Commit Graph

208 Commits

Author SHA1 Message Date
Colin Dellow 3ff08a3bc3 Intersects: faster queries for negative case (#635) 2024-01-08 20:01:01 +00:00
Colin Dellow a600524c90 extend NextRelation/FindInRelation to nodes (#632) 2024-01-07 16:51:08 +00:00
Richard Fairhurst 65829e48cd GeoJSON as alternative to shapefiles (#630) 2024-01-01 23:08:08 +00:00
Colin Dellow 2bb131b5c4 run Docker build on PRs (#627) 2023-12-30 13:43:35 +00:00
Colin Dellow ae1981b0f0 use vtzero instead of libprotobuf (#625) 2023-12-28 21:31:01 +00:00
Colin Dellow 12ed2414d9 use protozero (#623) 2023-12-28 20:07:50 +00:00
Colin Dellow d62c480deb be able to render the planet with 32gb of RAM (#618)
* move OutputObjects to mmap store

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB
of RAM.

* treat objects at low zoom specially

For GB, ~0.3% of objects are visible at low zooms.

I noticed in previous planet runs that fetching the objects for tiles in
the low zooms was quite slow - I think it's because we're scanning 1.3B
objects each time, only to discard most of them. Now we'll only be
scanning ~4M objects per tile, which is still an absurd number, but
should mitigate most of the speed issue without having to properly
index things.

This will also help us maintain performance for memory-constrained
users, as we won't be scanning all 15GB of data on disk, just a smaller
~45MB chunk.

* make more explicit that this is unexpected

* extend --materialize-geometries to nodes

For Points stored via Layer(...) calls, store the node ID in the
OSM store, unless `--materialize-geometries` is present.

This saves ~200MB of RAM for North America, so perhaps 1 GB for the
planet if NA has similar characteristics as the planet.

Also fix the OSM_ID(...) macro - it was lopping off many more bits
than needed, due to some previous experiments. Now that we want to track
nodes, we need at least 34 bits.

This may pose a problem down the road when we try to address thrashing.
The mechanism I hoped to use was to divide the OSM stores into multiple
stores covering different low zoom tiles. Ideally, we'd be able to
recall which store to look in -- but we only have 36 bits, we need 34
to store the Node ID, so that leaves us with 1.5 bits => can divide into
3 stores.

Since the node store for the planet is 44GB, dividing into 3 stores
doesn't give us very much headroom on a 32 GB box. Ah well, we can
sort this out later.

* rejig AttributePair layout

On g++, this reduces the size from 48 bytes to 34 bytes.

There aren't _that_ many attribute pairs, even on the planet scale, but
this plus a better encoding of string attributes might save us ~2GB at
the planet level, which is meaningful for a 32GB box

* fix initialization order warning

* add PooledString

Not used by anything yet. Given Tilemaker's limited needs, we can get
away with a stripped-down string class that is less flexible than
std::string, in exchange for memory savings.

The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang).

When it does allocate (for strings longer than 15 bytes), it allocates
from a pool so there's less per-allocation overhead.

* add tests for attribute store

...I'm going to replace the string implementation, so let's have some
backstop to make sure I don't break things

* rejig isHot

Break dependency on AttributePair, just work on std::string

* teach PooledString to work with std::string

...this will be useful for doing map lookups when testing if an
AttributePair has already been created with the given value.

* use PooledString in AttributePair

AttributePair has now been trimmed from 48 bytes to 18 bytes. There are
40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs,
so hopefully this is a savings of ~900MB at the planet level.

Runtime doesn't seem affected.

There's a further opportunity for savings if we can make more strings
qualify for the short string optimization. Only about 40% of strings
fit in the 15 byte short string optimization.

Of the remaining 60%, many are Latin-alphabet title cased strings like
`Wellington Avenue` -- this could be encoded using 5 bits per letter,
saving us an allocation.

Even in the most optimistic case where:

- there are 30M AttributePairs
- of these, 90% are strings (= 27M)
- of these, 60% don't fit in SSO (=16m)
- of these, we can make 100% fit in SSO

...we only save about 256MB at the planet level, but at some significant
complexity cost. So probably not worth pursuing at the moment.

* log timings

When doing the planet, especially on a box with limited memory, there
are long periods with no output. Show some output so the user doesn't
think things are hung.

This also might be useful in detecting perf regressions more granularly.

* AppendVector: an append-only chunked vector

When using --store, deque is nice because growing doesn't require
invalidating the old storage and copying it to a new location.

However, it's also bad, because deque allocates in 512-byte chunks,
which causes each 4KB OS page to have data from different z6 tiles.

Instead, use our own container that tries to get the best of both worlds.

Writing a random access iterator is new for me, so I don't trust this
code that much. The saving grace is that the container is very limited,
so errors in the iterator impelementation may not get exercised in
practice.

* fix progress when --store present

* mutex on RelationScan progress output

* make NodeStore/WayStore shardable

This adds three methods to the stores:

- `shard()` returns which shard you are
- `shards()` returns how many shards total
- `contains(shard, id)` returns whether or not shard N has an item with
  id X

SortedNodeStore/SortedWayStore are not implemented yet, that'll come in
a future commit.

This will allow us to create a `ShardedNodeStore` and `ShardedWayStore`
that contain N stores. We will try to ensure that each store has data
that is geographically close to each other.

Then, when reading, we'll do multiple passes of the PBF to populate each store.
This should let us reduce the working set used to populate the stores,
at the cost of additional linear scans of the PBF. Linear scans of disk
are much less painful than random scans, so that should be a good trade.

* add minimal SortedNodeStore test

I'm going to rejig the innards of this class, so let's have some tests.

* stop using internal linkage for atomics

In order to shard the stores, we need to have multiple instances
of the class.

Two things block this currently: atomics at file-level, and
thread-locals.

Moving the atomics to the class is easy.

Making the thread-locals per-class will require an approach similar
to that adopted in
https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48,
where we have a container that tracks the per-class data.

* SortedNodeStore: abstract TLS behind storage()

Still only supports 1 class, but this is a step along the path.

* SortedWayStore: abstract TLS behind storage()

* SortedNodeStore: support multiple instances

* SortedWayStorage: support multiple instances

* actually fix the low zoom object collection

D'oh, this "worked" due to two bugs cancelling each other:

(a) the code to find things in the low zoom list never found anything,
    because it assumed a base z6 tile of 0/0

(b) we weren't returning early, so the normal code still ran

Rejigged to actually do what I was intending

* AppendVector tweaks

* more low zoom fixes

* implement SortedNodeStore::contains

* implement SortedWayStore::contains

* use TileCoordinatesSet

* faster covered tile enumeration

Do a single pass,  rather than one pass per zoom.

* add ShardedNodeStore

This distributes nodes into one of 8 shards, trying to roughly group
parts of the globe by complexity.

This should help with locality when writing tiles.

A future commit will add a ShardedWayStore and teach read_pbf to read in
a locality-aware manner, which should help when reading ways.

* add ShardedWayStore

Add `--shard-stores` flag.

It's not clear yet this'll be a win, will need to benchmark.

The cost of reading the PBF blocks repeatedly is a bit higher than I was
expecting. It might be worth seeing if we can index the blocks to skip
fruitless reads.

* fewer, more balanced shards

* skip ReadPhase::Ways passes if node store is empty

* support multiple passes for ReadPhase::Relations

* fix check for first way

* adjust shards

With this distribution, no node shard is more than ~8.5GB.

* Relations: fix effectiveShards > 1 check

Oops, bug that very moderately affected performance in the non
`--shard-stores` case

* extend --materialize-geometries to LayerAsCentroid

It turns out that about 20% of LayerAsCentroid calls are for nodes,
which this branch could already do.

The remaining calls are predominantly ways, e.g. housenumbers.

We always materialize relation centroids, as they're expensive to
compute.

In GB, this saves about 6.4M points, ~102M. Scaled to the planet, it's
perhaps a 4.5GB savings, which should let us use a more aggressive shard
strategy.

It seems to add 3-4 seconds to the time to process GB.

* add `DequeMap`, change AttributeStore to use it

This implements the idea in https://github.com/systemed/tilemaker/issues/622#issuecomment-1866813888

Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`,
store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per
AttributePair and AttributeSet.

* capture s(this)

Seems to save ~1.5 seconds on GB

* fix warning

* fix warning, really

* fewer shards

Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is
3.7GB. Even ignoring the memory savings in the recent commits, these
could be merged.

* extract option parsing to own file

We'd like to have different defaults based on whether `--store` is
present. Now that option parsing will have some more complex logic,
let's pull it into its own class so it can be more easily tested.

* use sensible defaults based on presence of --store

* improve test coverage

* fixes

* update number of shards to 6

This has no performance impact as we never put anything in the 7th
shard, and so we skip doing the 7th pass in the ReadPhase::Ways and
ReadPhase::Relations phase.

The benefit is only to avoid emitting a noisy log about how the 7th store
has 0 entries in it.

Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb

The new peak memory is ~12.2GB.

I am a little perplexed -- the runtime on a 16-core server was
previously:

```
$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores
real	195m7.819s
user	2473m52.322s
sys	73m13.116s
```

But with the most recent commits on this branch, it was:

```
real	118m50.098s
user	1531m13.026s
sys	34m7.252s
```

This is incredibly suspicious. I also tried re-running commit
bbf0957c1e, and got:

```
real	123m15.534s
user	1546m25.196s
sys	38m17.093s
```

...so I can't explain why the earlier runs took 195 min.

Ideas:

- the planet changed between runs, and a horribly broken geometry was
  fixed

- Vultr gives quite different machines for the same class of server

- perhaps most likely: I failed to click "CPU-optimized" when picking
  the earlier server, and got a slow machine the first time, and a fast
  machine the second time. I'm pretty sure I paid the same $, so I'm
  not sure I believe this.

I don't think I really believe that a 33% reduction in runtime is
explained by any of those, though. Anyway, just another thing to
be befuddled by.

* --store uses lazy geometries; permit overriding

I did some experiments on a Hetzner 48-core box with 192GB of RAM:

--store, materialize geometries:
real 65m34.327s
user 2297m50.204s
sys 65m0.901s

The process often failed to use 100% of CPU--if you naively divide
user+sys/real you get ~36, whereas the ideal would be ~48.

Looking at stack traces, it seemed to coincide with calls to Boost's
rbtree_best_fit allocator.

Maybe:

- we're doing disk I/O, and it's just slower than recomputing the geometries
- we're using the Boost mmap library suboptimally -- maybe there's
  some other allocator we could be using. I think we use the mmap
  allocator like a simple bump allocator, so I don't know why we'd need
  a red-black tree

--store, lazy geometries:
real 55m33.979s
user 2386m27.294s
sys 23m58.973s

Faster, but still some overhead (user+sys/real => ~43)

no --store, materialize geometries: OOM

no --store, lazy geometries (used 175GB):
real 51m27.779s
user 2306m25.309s
sys 16m34.289s

This was almost 100% CPU - user+sys/real => ~45)

From this, I infer:

- `--store` should always default to lazy geometries in order to
  minimize the I/O burden

- `--materialize-geometries` is a good default for non-store usage,
  but it's still useful to be able to override and use lazy geometries,
  if it then means you can fit the data entirely in memory
2023-12-28 15:23:35 +00:00
Richard Fairhurst 5acee418ba PMTiles support (#620) 2023-12-22 10:45:05 +00:00
Colin Dellow 52b62dfbd5 some memory and concurrency improvements (#612)
* extract ClipCache to own file

Some housekeeping: extract clip_cache.cpp

* templatize ClipCache, apply to MultiLineStrings

This provides a very small benefit. I think the reason is two-fold:
there aren't many multilinestrings (relative to multipolygons), and
clipping them is less expensive.

Still, it did seem to provide a small boost, so leaving it in.

* housekeeping: move test, minunit

* --log-tile-timings: verbose timing logs

This isn't super useful to end users, but is useful for developers.

If it's not OK to leave it in, let me know & I'll revert it.

You can then process the log:

```bash
$ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt  | grep ' took ' | sort -nk3  | grep z$x/ | awk 'BEGIN { min = 999999; max = 0;  }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done

z0 1 7.04769 7.04769 7.047685 (z0/0/0)
z1 1 9.76067 9.76067 9.760671 (z1/0/0)
z2 1 9.98514 9.98514 9.985141 (z2/1/1)
z3 1 9.98514 9.98514 9.985141 (z3/2/2)
z4 2 14.4699 7.23493 8.610035 (z4/5/5)
z5 2 20.828 10.414 13.956526 (z5/10/11)
z6 5 6464.05 1292.81 3206.252711 (z6/20/23)
z7 13 11306.4 869.727 3275.475707 (z7/40/46)
z8 35 15787.1 451.061 2857.506681 (z8/81/92)
z9 86 20723.8 240.974 1605.788985 (z9/162/186)
z10 277 25456.8 91.9018 778.311785 (z10/331/369)
z11 960 28851.3 30.0534 627.351078 (z11/657/735)
z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471)
z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943)
z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916)
```

This shows each zoom's # of tiles, total time, average time, worst case
time (and the tile that caused it).

In general, lower zooms are slower than higher zooms. This seems
intuitively reasonable, as the lower zoom often contains all of
the objects in the higher zoom.

I would have guessed that a lower zoom would cost 4x the next higher
zoom on a per-tile basis. That's sort of the case for `z12->z13`,
`z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms,
where it's more like a 2x cost.

Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile.
This is probably because `water` has a minzoom of 6.

This all makes me think that the next big gain will be from re-using
simplifications.

This is sort of the mirror image of the clip cache:

- the clip cache avoids redundant clipping, and needs to be computed
  from lower zooms to higher zooms

- a simplification cache could make simplifying cheaper, but needs to
  be computed from higher zooms to lower zooms

The simplification cache also has two other wrinkles:

1. Is it even valid? e.g. is `simplify(object, 4)` the same as
   `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the
   same, because users are already accepting that we're losing accuracy
   when we simplify.

2. Rendering an object at `z - 1`, needds to (potentially) stitch together
   that object from 4 tiles at `z`. If those have each been simplified,
   we may introduce odd seams where the terminal points don't line up.

* more, smaller caches; run destructors outside lock

* use explicit types

* don't populate unnecessary vectors

* reserve vectors appropriately

* don't eagerly call way:IsClosed()

This saves a very little bit of time, but more importantly, tees up
lazily evaluating the nodes in a way.

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

This didn't actually seem to affect runtime that much on my 16 core
machine, but I think it'd help on machines with more cores.

* increase attributestore shards

When scaling to 32+ cores, this shows up as an issue. Try a really
blunt hammer fix.

* read_pbf: less lock contention on status

`std::cout` has some internal locks -- instead, let's synchronize
explicitly outside of it so we control the contention.

If a worker fails to get the lock, just skip that worker's update.

* tile_worker: do syscall 1x/thread, not 1x/tile

* tilemaker: avoid lock contention on status update

If a worker can't get the lock, just skip their update.

* Revert "don't eagerly call way:IsClosed()"

This reverts commit 3e7b9b62d1.

This commit came about from some experiments that I had done
pre-SortedNodeStore.

In that world, lazily evaluating the nodes of a way provided a
meaningful savings if the way was ultimately discarded by the Lua
code.

Post-SortedNodeStore, it doesn't seem to matter as much. Which is great,
as it means the store is much faster, but also means this commit is
just noise.

You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes

* update ifdef guard, add comments

* lazy way geometries

Tilemaker previously stored the 2D geometries it produced from ways.

This commit makes Tilemaker use the OSM way store to generate linestrings
and polygons that originated with an OSM way. You can get the old
behaviour with `--materialize-geometries`, which is a sensible choice if
you are not memory constrained.

For GB:

before (available via `--materialize-geometries`): 2m11s, 9600MB
this commit:  2m20s, 6400MB

So ~8% slower, but 33% less memory.

I think it's probably reasonable for this to be the default, which has
nice symmetry with compressed nodes and compressed ways being the
default.

Building NA with --store still seems OK - 36min. I was concerned that
the increased node store lookups could be vulnerable to thrashing.
I do see some stuttering during tile writing, but I think the decreased
read iops from a smaller geometry store balance out the increased
read iops from looking up nodes. A future investigation might be to
have SortedWayStore store latplons rather than node IDs -- a bit
more memory, but should be less CPU and less vulnerable to thrashing.

* improve tile coordinate generation

Before writing, we compute the set of tiles to be written.

There were two opportunities for improvement here:

- determining which tiles were covered by our objects: we previously
  used a `std::set`, which has poor memory and runtime behaviour.
  Instead, use a fixed size `std::vector<bool>` -- this takes 64MB
  at z14, but gives much faster insert/query times

- determining if tiles were covered by clipping box: we used
  boost's intersect algorithm before, which required constructing
  a TileBbox and was a bit slow. In the case where the tile is
  contained in a z6 tile that is wholly covered by the clipping
  box, we can short-circuit

This has the most impact when the set of objects or tiles is very
large--e.g. Antarctica, North America or bigger.

* SortedNodeStore: only do arena allocations

On a 48-core server, I noticed lock contention on the mmap allocator.

So let's just always use pools of memory, and pick a bigger pool size.

This means we'll sometimes allocate memory that we don't use.

In the extreme case of Monaco, we only need like 200KB, but we'll
allocate several megs.

As you scale to larger PBFs, the waste trends to 0%, so this should
be fine in practice.

* remove TODO

* fix Windows build

D'oh, clock_gettime is Linux-ish. `std::chrono` may have a
cross-platform option, but it's not clear.

For now, just omit this functionality on Windows. If we want to expose
it, we can explore something in std::chrono or make a wrapper that
calls QueryPerformanceCounter on Windows.

* sigh

* fix bounds check
2023-12-15 17:04:46 +00:00
Richard Fairhurst 6940f98346 Make shapefile processing multi-threaded (#614) 2023-12-15 16:24:56 +00:00
Richard Fairhurst 34c356a9f5 Coastline tweaks (#611) 2023-12-11 20:38:04 +00:00
Richard Fairhurst b2785690d2 GeoJSON writer for debugging (#609) 2023-12-09 21:38:39 +00:00
Colin Dellow 0d951acc8a buffer sqlite writes if needed (#608)
Fixes #596, I think.

Time to process Antarctica drops from 9m5s to 3m51s
2023-12-09 13:48:01 +00:00
Colin Dellow 8300b0cdd9 Alternate node store (#590)
* refactor NodeStore

I'd like to add an alternative NodeStore that can be used when the
`Type_then_ID` property is present in the PBF.

First, a small (?) refactor:

- make `NodeStore` an interface, with two concrete implementations
- extract the NodeStore related things to their own files
- this will cause some churn, as they'll depend on things that also
  need to get extracted to their own files. Short term pain, hopefully
  long term gain in faster compile times.

Changing the invocations of the functions to be virtual may have impact
on performance. Will need to revisit that before committing to virtual
methods.

* change how work is assigned for ReadPhase::Nodes

Currently, when a worker needs work, it gets the next unprocessed block.
This means blocks are read sequentially at a global level, but from
the perspective of each worker, there are gaps in the blocks they see.

For nodes, we'd prefer to give each worker thread contiguous blocks
from the underlying PBF. This will enable a more efficient storage
for PBFs with the `Sort.Type_then_ID` flag.

* add SortedNodeStore

SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID`
property, e.g. the planet and Geofabrik exports.

It stores nodes in a hierarchy:

- Level 1 is groups: there are 256K groups
- Level 2 is chunks: each group has 256 chunks
- Level 3 is nodes: each chunk has 256 nodes

This allows us to store 2^34 nodes, with a fixed overhead of
only 2M -- the space required for the level 1 pointers.

Groups and chunks store their data sparsely. If a group has 7 chunks,
it only uses storage for 7 chunks.

On Great Britain's 184M node PBF, it needs ~9.13 bytes per node.

Looking up a node can be done in fixed time:

First, get some offsets:
- Group: `nodeID / 65536`
- Chunk: `(nodeID / 65536) / 256`
- Position within chunk: `nodeID % 256`

For example, Cape Chignecto Provincial Park has ID 4855703, giving:
- Group 74
- Chunk 23
- Offset 151

Group 74's chunks may be sparse. To map chunk 23 to its physical
location, each group has a 256-bit bitmask indicating which
chunks are present.

Use its physical location to get its `chunkOffset`. That allows you
to get to the `ChunkInfo` struct.

From there, do the same thing to get the node data.

This design should also let us do some interesting things down the road,
like efficiently compressing each chunk using something like delta
encoding, zigzag encoding and bit packing. Then, to avoid paying a
decompression cost, we'd likely give each worker a cache of uncompressed
chunks.

* cmake build

* tidy up

* tweak

* tweak

* derp

* mac/windows build

* fix build?

I don't understand why these can't be passed as a copy in the Windows
and Mac builds. Whatever, try passing a reference.

* fix --store

I think nested containers may not be wired up quite correctly.
Instead, manage the char* buffers directly, rather than as
`std::vector<char>`

I'll fixup the other aspects (attributing libpopcnt, picking
Sorted vs BinarySearch on the fly) later

* attribution for libpopcnt

* simplify read_pbf

All read phases use the same striding-over-batches-of-blocks approach.

This required changing how progress is reported, as block IDs are no
longer globally montonically increasing.

Rather than thread the state into ReadBlock, I just adopted 2 atomic
counters for the whole class -- the progress reporter already assumes
that it's the only thing dumping to stdout, so the purity of avoiding
class-global doesn't buy us anything.

* clear allocatedMemory

* use scale factor 16, not 8

D'oh, if you get a full group where each chunk is full, you need to be
able to express a value _ever so slightly_ larger than 65,536.

North America and Europe have examples of this.

Use a scale factor of 16, not 8. This'll mean some chunks have up to 15
wasted bytes, but it's not a huge deal. (And I have some thoughts on how
to claw it back.)

* comment out debug stats

* windows build

* derp

* use SortedNodeStore if PBFs have Sort.Type_then_ID

* add --compress-nodes

If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte)
to compress chunks of nodes in memory.

The impact on read time is not much:
- GB with `--compress-nodes`: 1m42s
- without: 1m35s

But the impact on memory is worthwhile, even across very different
extracts:

North America - 5.52 bytes/node vs 8.48 bytes/node
169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes
169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes

Great Britain - 5.97 bytes/node vs 9.25 bytes/node
163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes
163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes

Nova Scota - 5.81 bytes/node vs 8.7 bytes/node
26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes
26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes

Monaco - 10.43 bytes/node vs 13.52 bytes/node
1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes
1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes

* build

* build

* remove __restrict__ to satisfy windows build

* remove debug print, small memory optimization

* use an arena for small groups

* omit needless words

* better short-circuiting for Type-then-ID PBFs

Track metadata about which blocks have nodes, ways and relations.
By default, we assume any block may contain nodes, ways or relations.

If the PBF supports Type-then-ID PBFs, do a binary search to find the first
blocks with ways and relations.

This means ReadPhase::Nodes can stop without scanning ways/relations.
In addition to avoiding needless work, it makes it easier to assign
each worker a balanced amount of work -- now each worker has only
blocks with nodes, which are about the same effort computationally.

It also makes ReadPhase::ScanRelations faster, as it scans exactly the
blocks with relations, skipping the blocks with ways.

Similarly, ReadPhase::Ways is a bit faster, as it doesn't have to read
the blocks with relations.

For North America, this reduces the time to complete the Nodes and
RelationsScan phase from 2m30s to 1m20s.

For GB, it reduces the time from 22s to 9s.

* ReadPhase::Relations - more parallelism

When processing relations for small extracts, there are often fewer
blocks than cores.

Instead, divide the work more granularly, assigning each of the N
threads 1/Nth of the block to process.

This saves 4-5 seconds (which is cumulatively ~20% of runtime) for
the Canadian province of Nova Scotia.

* extract WayStore, BinarySearchWayStore

* stub in SortedWayStore

...it just throws a lot of exceptions at the moment.

* put SortedNodeStore in a namespace

Also replace some `#define`s with `const`s.

I'm likely going to reuse some names in SortedWayStore, so namespacing
to avoid conflicts.

* don't use SortedWayStore if LocationsOnWays present

* stub in insertLatpLons/insertNodes

* change at() to return a non mmap vector

SortedWayStore won't create mmaped vectors, so we need to return the
lowest common denominator.

This pessimizes performance of BinarySearchWayStore, since it'll have
to allocate vectors on demand.

Longer term: it might be better to return an iterator that hides the heavy
lifting.

* begin drawing the rest of the owl

* flesh out types

* add unit test framework

* naive encoding of ways

Checkpointing since I have something that works.

Future optimizations:

- when all high ints are the same, don't encode them
- compression

* more efficient if high ints are all the same

* extract mmap_allocator.cpp

This is needed to unit test the way store without dragging
in osm_store.

* progress on publishGroup

checkpointing, going to extract a populateMask(...) function

* add populateMask function

* finish publishGroup

* SortedWayStore: implement at

* pass node store into SortedWayStore

* fix alignment

* better logs

* way stores should throw std::out_of_range

This is part of the contract, client code will catch it and reject
relations that have missing ways.

* sortednodestore: throw std::out_of_range

* support way compression

* remove dead code, robust against empty ways

* implement clear()

* maybe fix windows build?

very unclear why this is needed, but we seem to be getting C2131 on this
line.

* don't use variable-length arrays on stack

Workaround for MSVC

* avoid more variable-length arrays

* make the other vectors as thread-local

* --no-compress-ways, --no-compress-nodes
2023-12-09 13:47:07 +00:00
Colin Dellow 73771a1e13 Clipping monster polygons 2 (#607)
* clipping monster polygons, take 2

This replaces https://github.com/systemed/tilemaker/pull/606

Rather than dealing with the special case of fully covered or fully
excluded, we keep a cache of recently clipped geometries.

A higher zoom will look in the cache for a clipped geometry of
the object from a lower zoom and re-use it if possible, falling
back to the whole geometry only if no cache entry is found.

Hudson Bay, master: 22m27s
this PR: 1m28s

This relies on the assumption that clipping a clipped polygon gives
the same result as clipping the original polygon. That seems
reasonable to me, but if you know of edge cases to consider,
let me know.

* more robustness against invalid Antarctica nodes

This is another fix in the vein of https://github.com/systemed/tilemaker/pull/595/commits/066ca0a91ea2981f55e63cd616af07a5dfff3125

Testing Antarctica with this PR: 9m5s

The generated mbtiles doesn't seem to be convertible to a pmtiles file
-- `pmtiles convert` fails with:

```
main.go:162: Failed to convert /home/cldellow/src/basemap/tiles.mbtiles, Missing row
```

I've never generated Antarctica with the previous code, so I'm not sure
if this is specific to this PR or not.

My intuition is this is not specific to this PR. Wild guess is that
we're creating tiles with no features, due to https://github.com/systemed/tilemaker/blob/d470dc94fea6ae74e948c1a858c6e1228c9f4bf9/src/tile_worker.cpp#L210, and pmtiles doesn't like that.
2023-12-08 13:01:35 +00:00
Richard Fairhurst 0dbbe6ed38 Better hash function for TileCoordinates 2023-12-07 17:19:40 +00:00
Richard Fairhurst d470dc94fe Merge pull request #602 from systemed/scale_backtrack
Avoid creating small artefacts when scaling
2023-12-05 09:46:09 +00:00
Colin Dellow 4824688c14 AttributeStore: skip locks when outputting (#599)
The locks in AttributeStore are necessary only during PBF reading,
to avoid concurrent mutations corrupting things.

Once we're writing the mbtiles, it's safe to read without acquiring
the lock. This eliminates ~9% of system time, and ~2-3% of wall clock
time.

The PR also adds a `finalize()` to AttributeStore, AttributeKeyStore and
AttributePairStore. Nothing actually uses this yet - I initially checked
the `finalized` variable and threw if an unsafe method was called, but that
gave up the speed benefits, so I removed it again.

Perhaps in the future, a debug build could leave such checks in to
detect programming errors.
2023-12-04 21:29:49 +00:00
systemed 2614985dbd Tidy logging, remove tsl 2023-12-04 21:25:35 +00:00
systemed 351b8ff2cc Backtrack when scaling to avoid spikes 2023-12-04 18:53:24 +00:00
systemed b33ccf04ee Merge branch 'master' into refactor_geometries 2023-12-02 21:14:29 +00:00
systemed c217e4d864 Merge branch 'master' into refactor_geometries 2023-12-02 21:05:18 +00:00
Colin Dellow 5e06647cf3 Remove output object ref (#595) 2023-12-02 21:04:40 +00:00
Richard Fairhurst 1da4be97dc Optimise sqlite inserts (#598) 2023-12-01 13:10:24 +00:00
systemed 33ad2c186e Revert "Update to latest version of kleunen/boost_geometry_correct (#581)"
This reverts commit cfea9728ac.
2023-11-27 13:24:04 +00:00
systemed ff0b9d6676 Merge branch 'master' into refactor_geometries 2023-11-19 18:06:53 +00:00
Richard Fairhurst cfea9728ac Update to latest version of kleunen/boost_geometry_correct (#581) 2023-11-19 18:05:02 +00:00
Colin Dellow 3b3b8f1d3a AttributeStore memory tweaks (#583)
* make AttributeStore::get const

I think AttributeStore lives forever, and AttributeSets are immutable
once added to it, so we can avoid the copy.

* use a string pool for AttributeSet keys

There are relatively few unique key values for attributes, e.g.
`kind`, `name`, `admin_level`.

The Shortbread schema has only ~50 or so. I imagine OMT is similar,
but haven't checked.

We generate lots of AttributePairs -- on the order of tens of millions
for GB, and std::string has an overhead of 32 bytes. By using a string
pool and storing only an offset into it, we can save a few hundred MB
of RAM.

* lock-free reads for keys, vector for pairs

This is the groundwork for implementing two future improvements:

- hot/cold pairs: there is a bimodal distribution of attribute
  frequency. `landuse=wood`, `tunnel=0` are often duplicated.
  `name=Sneed's Seed & Feed` is not.

  In the future, we'll try to re-use the "hot" pairs to avoid
  paying the cost of an AttributePair for them.

- "short vectors" - similar to the short string optimization,
  we should be able to pack up to 6 pairs (3 hot, 3 cold) in
  the overhead that a vector would otherwise use.

As it stands, this commit increases memory usage. But we'll claw
a lot of it back, and then some.

* Have a "hot" shard for popular pairs

If a pair looks like it might be re-usable, put it in a special
shard and be able to re-use it.

The special shard is limited to max 64K items, teeing up future
work to have a simple vector for AttributeSets with few pairs.

* treat 0 as a sentinel

* de-dupe all AttributePairs

The stats I was looking at were counting AttributePairs via
AttributeSets, which of course presents a misleading image
of how many duplicate AttributePairs there are, because by
that point, they've already been deduped.

De-duping doesn't add that much runtime overhead--and it could
probably be improved by someone who knows more C++ concurrency
tricks than me.

* store pointers in pairMaps, optimize debug spew

`Tile_Value` is a really memory-expensive object. Since we maintain
long-lived references to the canonical AttributePair, we can store
pointers to save a bit more memory.

Now that value->AttributePairs are guaranteed to be 1:1, we can do our
debug statistics on ints, and translate to pairs only when writing to
stdout.

* use boost::container::flat_map over std::map

Doesn't appreciably affect runtime, saves a bit of memory.

* don't memoize hash function

Now that there is a 1:1 mapping between values and AttributePairs,
it's trivial to compute the hash on demand.

* output_object: avoid Tile_Value temporaries

Also const-ify a few things

* defer creating Tile_Value

Tile_Value is a big union that takes up 96 bytes,
but for our purposes, we're happy with a union of
string, float and bool -- which can be expressed
in 28 bytes. We need a discriminator variable, but
due to alignment, that's free.

I also consider `boost::variant<bool, float, string>`,
but it seemed to take 40 bytes.

I worried that not having a pool of Tile_Values would
affect PBF writing time, but it seems unaffected.

* adjust headers, remove unneeded rng

* any integer 0 <= 25 is eligible for hot pool

This is useful for ranks, which run from 1..25

* Use a small vector optimization for pair indexes

`vector<uint32_t>` takes 24 bytes just to store its internal pointers.
If you actually want to store a `uint32_t` in it, it'll then allocate
some memory on the heap, taking a further 32-64 bytes depending on STL
and malloc implementations.

56-88 bytes! For a single `uint32_t`! Outrageous.

Instead, store references to pair indexes in an array of shorts. If
the pairs don't fit in the array, upgrade it to a vector.

Since we previously arranged for very popular pairs like `amenity=toilets`
to have small indexes, our array of shorts is capable of storing between
4 and 8 pairs before we need to upgade to a vector. Most AttributeSets
will not need to use a vector.

* simplify AttributeKeyStore

* use camelCase

* re-write to avoid static lifetime

AttributeKeyStore/AttributePairStore have the same lifetime as
AttributeStore, so just make them owned by it.

This results in slightly more convoluted code, but avoids having
them floating around as globals.

* reduce lock contention

* Improve TileCoordinates hash function

x ^ y will only use as many bits as max(x, y), but tiles
only use the full 32-bit space at z16, so we're leaving
a lot of the hash space on the table.

* d'oh, avoid looking up the key name needlessly

* change AttributeXyz(...) to be last-written wins

Previously, if you set the same key to different values, it was
not guaranteed that the last value written would win.

* remove misleading comment

* include deque

* include map

* return vector, not set

set seems a bit like overkill - we already know the items are unique,
and the consumer is likely just going to iterate over them

* avoid GNU-specific initializer

also avoid hardcoding 12

* Revert "Improve TileCoordinates hash function"

This reverts commit 7570737715.

Oops, I think this change isn't meaningful, and is a result of me
misreading the original code.

It might still be an improvement to do something like
`hash(x << 16) ^ hash(y)`, since the default TileCoordinate is only 16
bits, but that can be considered independent of this PR.

* remove dead code

* avoid copying AttributePairs

They're long-lived, so pass pointers

* OutputObjects - greatly reduce need for locks

I'm slowly remembering how to write concurrent code...

* AttributeKeyStore: use a TLS cache

This should reduce futex contention significantly. I'll
apply the same change for AttributePairStore's shard 0, then
measure.

* AttributePairStore: reduce lock contention

* ensure atomics are initialized

Per https://stackoverflow.com/questions/36320008/whats-the-default-value-for-a-stdatomic,
they aren't initialized by default. Somewhat surprised this didn't
result in crashes.

* don't store duplicate way geometries

A common pattern is:

```lua
way:Layer("waterway", false)
...
way:Layer("waterway_names", false)
```

Previously, we'd process the geometry twice, and store a second copy of
it in memory. Instead, re-use the previously stored geometry.

This saves another ~1GB of memory for the GB extract. It doesn't
seem to affect runtime - I think we only re-use linestrings, and
linestrings are relatively cheap to do `is_valid` on.

It seems like with the rest of the work on this branch, the
`OutputObjectXyz` classes are very thin -- inspecting `geomType` in
order to construct the right was a bit tedious, so I removed them.
2023-11-19 17:59:05 +00:00
systemed d9bb72b929 Merge branch 'master' into refactor_geometries 2023-11-16 21:16:44 +00:00
Colin Dellow 8b45a8a33e Avoid copying strings/tag maps (#577) 2023-11-12 19:03:02 +00:00
Colin Dellow 8c059d9cf5 Re-use ifstream and OSMLuaProcessing (#578) 2023-11-12 17:58:25 +00:00
Colin Dellow 9c7081baed shard the node store to reduce memory usage (#571) 2023-11-12 17:36:22 +00:00
systemed 4db8c25417 Merge branch 'master' into refactor_geometries 2023-11-04 12:58:56 +00:00
Richard Fairhurst dbdb4da097 RestartRelations() to reset relation subscript (#548) 2023-10-09 11:30:03 +01:00
systemed 05c3cda557 Set maximum zoom for feature limit 2023-10-04 10:17:59 +01:00
systemed 0ea137fedb Per-tile feature limit as per #547 2023-10-03 17:52:02 +01:00
systemed 2249202751 Keep output list as OutputObjects for longer 2023-09-19 22:17:21 +01:00
systemed 22a3e64c68 Store objects sequentially (breaks include_ids) 2023-09-19 18:43:16 +01:00
systemed e2f6e17374 Use tsl::ordered_set for attribute storage 2023-09-19 12:24:05 +01:00
systemed adcdb6e892 Merge branch 'master' into refactor_geometries 2023-09-18 18:24:02 +01:00
Richard Fairhurst 439c6b1027 Report OSM ID on Lua processing error (#535) 2023-09-18 14:22:36 +01:00
Richard Fairhurst 528aa25dec Support type=boundary relations as native multipolygons (#508) 2023-07-23 08:57:39 +01:00
holzgeist d3c46edacb fix: CMake build on Arch Linux (#503)
* added missing include
* set required C++ version to 17

Co-authored-by: Tobias Ollmann <tobias@holzgeist.at>
2023-07-16 13:32:06 +01:00
systemed 23d8e6da59 Store attribute sets by index 2023-06-18 14:45:04 +01:00
systemed 856951daae Reinstate output/attribute pair (memory leak) 2023-06-13 12:25:27 +01:00
systemed b7cf92991d Use cached geometries rather than regenerating 2023-06-12 12:31:13 +01:00
systemed 1f8e5187ef Common add-to-index code across osm and shp 2023-06-12 12:01:40 +01:00
systemed ed14fc427d Break pair into separate output/attribute lists 2023-06-12 10:09:32 +01:00
systemed 95fa2a0b57 Optimise bitfield ordering (20 bytes -> 16) 2023-06-11 13:59:29 +01:00
systemed 594b48b94a Additional store only needed for indexed objects 2023-06-11 13:45:47 +01:00