mirror of
https://github.com/systemed/tilemaker.git
synced 2026-05-09 18:00:09 -04:00
master
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5b96711a09 |
--shard-stores and --compact aren't compatible
CompactNodeStore doesn't know how to compute if it contains a node, which is a prerequisite for sharding. The two settings don't make much sense together: sharding will create N CompactNodeStores, which each will take as much memory as a single one, since each will likely have a large node ID. This differs from BinarySearchNodeStore and SortedNodeStore, where each of the N store instances will take roughly 1/N memory. Instead: - fail faster and more clearly by throwing if CompactNodeStore.contains is called - don't enable sharding if --compact is passed |
||
|
|
d62c480deb |
be able to render the planet with 32gb of RAM (#618)
* move OutputObjects to mmap store
For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB
of RAM.
* treat objects at low zoom specially
For GB, ~0.3% of objects are visible at low zooms.
I noticed in previous planet runs that fetching the objects for tiles in
the low zooms was quite slow - I think it's because we're scanning 1.3B
objects each time, only to discard most of them. Now we'll only be
scanning ~4M objects per tile, which is still an absurd number, but
should mitigate most of the speed issue without having to properly
index things.
This will also help us maintain performance for memory-constrained
users, as we won't be scanning all 15GB of data on disk, just a smaller
~45MB chunk.
* make more explicit that this is unexpected
* extend --materialize-geometries to nodes
For Points stored via Layer(...) calls, store the node ID in the
OSM store, unless `--materialize-geometries` is present.
This saves ~200MB of RAM for North America, so perhaps 1 GB for the
planet if NA has similar characteristics as the planet.
Also fix the OSM_ID(...) macro - it was lopping off many more bits
than needed, due to some previous experiments. Now that we want to track
nodes, we need at least 34 bits.
This may pose a problem down the road when we try to address thrashing.
The mechanism I hoped to use was to divide the OSM stores into multiple
stores covering different low zoom tiles. Ideally, we'd be able to
recall which store to look in -- but we only have 36 bits, we need 34
to store the Node ID, so that leaves us with 1.5 bits => can divide into
3 stores.
Since the node store for the planet is 44GB, dividing into 3 stores
doesn't give us very much headroom on a 32 GB box. Ah well, we can
sort this out later.
* rejig AttributePair layout
On g++, this reduces the size from 48 bytes to 34 bytes.
There aren't _that_ many attribute pairs, even on the planet scale, but
this plus a better encoding of string attributes might save us ~2GB at
the planet level, which is meaningful for a 32GB box
* fix initialization order warning
* add PooledString
Not used by anything yet. Given Tilemaker's limited needs, we can get
away with a stripped-down string class that is less flexible than
std::string, in exchange for memory savings.
The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang).
When it does allocate (for strings longer than 15 bytes), it allocates
from a pool so there's less per-allocation overhead.
* add tests for attribute store
...I'm going to replace the string implementation, so let's have some
backstop to make sure I don't break things
* rejig isHot
Break dependency on AttributePair, just work on std::string
* teach PooledString to work with std::string
...this will be useful for doing map lookups when testing if an
AttributePair has already been created with the given value.
* use PooledString in AttributePair
AttributePair has now been trimmed from 48 bytes to 18 bytes. There are
40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs,
so hopefully this is a savings of ~900MB at the planet level.
Runtime doesn't seem affected.
There's a further opportunity for savings if we can make more strings
qualify for the short string optimization. Only about 40% of strings
fit in the 15 byte short string optimization.
Of the remaining 60%, many are Latin-alphabet title cased strings like
`Wellington Avenue` -- this could be encoded using 5 bits per letter,
saving us an allocation.
Even in the most optimistic case where:
- there are 30M AttributePairs
- of these, 90% are strings (= 27M)
- of these, 60% don't fit in SSO (=16m)
- of these, we can make 100% fit in SSO
...we only save about 256MB at the planet level, but at some significant
complexity cost. So probably not worth pursuing at the moment.
* log timings
When doing the planet, especially on a box with limited memory, there
are long periods with no output. Show some output so the user doesn't
think things are hung.
This also might be useful in detecting perf regressions more granularly.
* AppendVector: an append-only chunked vector
When using --store, deque is nice because growing doesn't require
invalidating the old storage and copying it to a new location.
However, it's also bad, because deque allocates in 512-byte chunks,
which causes each 4KB OS page to have data from different z6 tiles.
Instead, use our own container that tries to get the best of both worlds.
Writing a random access iterator is new for me, so I don't trust this
code that much. The saving grace is that the container is very limited,
so errors in the iterator impelementation may not get exercised in
practice.
* fix progress when --store present
* mutex on RelationScan progress output
* make NodeStore/WayStore shardable
This adds three methods to the stores:
- `shard()` returns which shard you are
- `shards()` returns how many shards total
- `contains(shard, id)` returns whether or not shard N has an item with
id X
SortedNodeStore/SortedWayStore are not implemented yet, that'll come in
a future commit.
This will allow us to create a `ShardedNodeStore` and `ShardedWayStore`
that contain N stores. We will try to ensure that each store has data
that is geographically close to each other.
Then, when reading, we'll do multiple passes of the PBF to populate each store.
This should let us reduce the working set used to populate the stores,
at the cost of additional linear scans of the PBF. Linear scans of disk
are much less painful than random scans, so that should be a good trade.
* add minimal SortedNodeStore test
I'm going to rejig the innards of this class, so let's have some tests.
* stop using internal linkage for atomics
In order to shard the stores, we need to have multiple instances
of the class.
Two things block this currently: atomics at file-level, and
thread-locals.
Moving the atomics to the class is easy.
Making the thread-locals per-class will require an approach similar
to that adopted in
https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48,
where we have a container that tracks the per-class data.
* SortedNodeStore: abstract TLS behind storage()
Still only supports 1 class, but this is a step along the path.
* SortedWayStore: abstract TLS behind storage()
* SortedNodeStore: support multiple instances
* SortedWayStorage: support multiple instances
* actually fix the low zoom object collection
D'oh, this "worked" due to two bugs cancelling each other:
(a) the code to find things in the low zoom list never found anything,
because it assumed a base z6 tile of 0/0
(b) we weren't returning early, so the normal code still ran
Rejigged to actually do what I was intending
* AppendVector tweaks
* more low zoom fixes
* implement SortedNodeStore::contains
* implement SortedWayStore::contains
* use TileCoordinatesSet
* faster covered tile enumeration
Do a single pass, rather than one pass per zoom.
* add ShardedNodeStore
This distributes nodes into one of 8 shards, trying to roughly group
parts of the globe by complexity.
This should help with locality when writing tiles.
A future commit will add a ShardedWayStore and teach read_pbf to read in
a locality-aware manner, which should help when reading ways.
* add ShardedWayStore
Add `--shard-stores` flag.
It's not clear yet this'll be a win, will need to benchmark.
The cost of reading the PBF blocks repeatedly is a bit higher than I was
expecting. It might be worth seeing if we can index the blocks to skip
fruitless reads.
* fewer, more balanced shards
* skip ReadPhase::Ways passes if node store is empty
* support multiple passes for ReadPhase::Relations
* fix check for first way
* adjust shards
With this distribution, no node shard is more than ~8.5GB.
* Relations: fix effectiveShards > 1 check
Oops, bug that very moderately affected performance in the non
`--shard-stores` case
* extend --materialize-geometries to LayerAsCentroid
It turns out that about 20% of LayerAsCentroid calls are for nodes,
which this branch could already do.
The remaining calls are predominantly ways, e.g. housenumbers.
We always materialize relation centroids, as they're expensive to
compute.
In GB, this saves about 6.4M points, ~102M. Scaled to the planet, it's
perhaps a 4.5GB savings, which should let us use a more aggressive shard
strategy.
It seems to add 3-4 seconds to the time to process GB.
* add `DequeMap`, change AttributeStore to use it
This implements the idea in https://github.com/systemed/tilemaker/issues/622#issuecomment-1866813888
Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`,
store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per
AttributePair and AttributeSet.
* capture s(this)
Seems to save ~1.5 seconds on GB
* fix warning
* fix warning, really
* fewer shards
Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is
3.7GB. Even ignoring the memory savings in the recent commits, these
could be merged.
* extract option parsing to own file
We'd like to have different defaults based on whether `--store` is
present. Now that option parsing will have some more complex logic,
let's pull it into its own class so it can be more easily tested.
* use sensible defaults based on presence of --store
* improve test coverage
* fixes
* update number of shards to 6
This has no performance impact as we never put anything in the 7th
shard, and so we skip doing the 7th pass in the ReadPhase::Ways and
ReadPhase::Relations phase.
The benefit is only to avoid emitting a noisy log about how the 7th store
has 0 entries in it.
Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb
The new peak memory is ~12.2GB.
I am a little perplexed -- the runtime on a 16-core server was
previously:
```
$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores
real 195m7.819s
user 2473m52.322s
sys 73m13.116s
```
But with the most recent commits on this branch, it was:
```
real 118m50.098s
user 1531m13.026s
sys 34m7.252s
```
This is incredibly suspicious. I also tried re-running commit
|
||
|
|
8300b0cdd9 |
Alternate node store (#590)
* refactor NodeStore I'd like to add an alternative NodeStore that can be used when the `Type_then_ID` property is present in the PBF. First, a small (?) refactor: - make `NodeStore` an interface, with two concrete implementations - extract the NodeStore related things to their own files - this will cause some churn, as they'll depend on things that also need to get extracted to their own files. Short term pain, hopefully long term gain in faster compile times. Changing the invocations of the functions to be virtual may have impact on performance. Will need to revisit that before committing to virtual methods. * change how work is assigned for ReadPhase::Nodes Currently, when a worker needs work, it gets the next unprocessed block. This means blocks are read sequentially at a global level, but from the perspective of each worker, there are gaps in the blocks they see. For nodes, we'd prefer to give each worker thread contiguous blocks from the underlying PBF. This will enable a more efficient storage for PBFs with the `Sort.Type_then_ID` flag. * add SortedNodeStore SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID` property, e.g. the planet and Geofabrik exports. It stores nodes in a hierarchy: - Level 1 is groups: there are 256K groups - Level 2 is chunks: each group has 256 chunks - Level 3 is nodes: each chunk has 256 nodes This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. Groups and chunks store their data sparsely. If a group has 7 chunks, it only uses storage for 7 chunks. On Great Britain's 184M node PBF, it needs ~9.13 bytes per node. Looking up a node can be done in fixed time: First, get some offsets: - Group: `nodeID / 65536` - Chunk: `(nodeID / 65536) / 256` - Position within chunk: `nodeID % 256` For example, Cape Chignecto Provincial Park has ID 4855703, giving: - Group 74 - Chunk 23 - Offset 151 Group 74's chunks may be sparse. To map chunk 23 to its physical location, each group has a 256-bit bitmask indicating which chunks are present. Use its physical location to get its `chunkOffset`. That allows you to get to the `ChunkInfo` struct. From there, do the same thing to get the node data. This design should also let us do some interesting things down the road, like efficiently compressing each chunk using something like delta encoding, zigzag encoding and bit packing. Then, to avoid paying a decompression cost, we'd likely give each worker a cache of uncompressed chunks. * cmake build * tidy up * tweak * tweak * derp * mac/windows build * fix build? I don't understand why these can't be passed as a copy in the Windows and Mac builds. Whatever, try passing a reference. * fix --store I think nested containers may not be wired up quite correctly. Instead, manage the char* buffers directly, rather than as `std::vector<char>` I'll fixup the other aspects (attributing libpopcnt, picking Sorted vs BinarySearch on the fly) later * attribution for libpopcnt * simplify read_pbf All read phases use the same striding-over-batches-of-blocks approach. This required changing how progress is reported, as block IDs are no longer globally montonically increasing. Rather than thread the state into ReadBlock, I just adopted 2 atomic counters for the whole class -- the progress reporter already assumes that it's the only thing dumping to stdout, so the purity of avoiding class-global doesn't buy us anything. * clear allocatedMemory * use scale factor 16, not 8 D'oh, if you get a full group where each chunk is full, you need to be able to express a value _ever so slightly_ larger than 65,536. North America and Europe have examples of this. Use a scale factor of 16, not 8. This'll mean some chunks have up to 15 wasted bytes, but it's not a huge deal. (And I have some thoughts on how to claw it back.) * comment out debug stats * windows build * derp * use SortedNodeStore if PBFs have Sort.Type_then_ID * add --compress-nodes If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte) to compress chunks of nodes in memory. The impact on read time is not much: - GB with `--compress-nodes`: 1m42s - without: 1m35s But the impact on memory is worthwhile, even across very different extracts: North America - 5.52 bytes/node vs 8.48 bytes/node 169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes 169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes Great Britain - 5.97 bytes/node vs 9.25 bytes/node 163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes 163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes Nova Scota - 5.81 bytes/node vs 8.7 bytes/node 26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes 26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes Monaco - 10.43 bytes/node vs 13.52 bytes/node 1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes 1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes * build * build * remove __restrict__ to satisfy windows build * remove debug print, small memory optimization * use an arena for small groups * omit needless words * better short-circuiting for Type-then-ID PBFs Track metadata about which blocks have nodes, ways and relations. By default, we assume any block may contain nodes, ways or relations. If the PBF supports Type-then-ID PBFs, do a binary search to find the first blocks with ways and relations. This means ReadPhase::Nodes can stop without scanning ways/relations. In addition to avoiding needless work, it makes it easier to assign each worker a balanced amount of work -- now each worker has only blocks with nodes, which are about the same effort computationally. It also makes ReadPhase::ScanRelations faster, as it scans exactly the blocks with relations, skipping the blocks with ways. Similarly, ReadPhase::Ways is a bit faster, as it doesn't have to read the blocks with relations. For North America, this reduces the time to complete the Nodes and RelationsScan phase from 2m30s to 1m20s. For GB, it reduces the time from 22s to 9s. * ReadPhase::Relations - more parallelism When processing relations for small extracts, there are often fewer blocks than cores. Instead, divide the work more granularly, assigning each of the N threads 1/Nth of the block to process. This saves 4-5 seconds (which is cumulatively ~20% of runtime) for the Canadian province of Nova Scotia. * extract WayStore, BinarySearchWayStore * stub in SortedWayStore ...it just throws a lot of exceptions at the moment. * put SortedNodeStore in a namespace Also replace some `#define`s with `const`s. I'm likely going to reuse some names in SortedWayStore, so namespacing to avoid conflicts. * don't use SortedWayStore if LocationsOnWays present * stub in insertLatpLons/insertNodes * change at() to return a non mmap vector SortedWayStore won't create mmaped vectors, so we need to return the lowest common denominator. This pessimizes performance of BinarySearchWayStore, since it'll have to allocate vectors on demand. Longer term: it might be better to return an iterator that hides the heavy lifting. * begin drawing the rest of the owl * flesh out types * add unit test framework * naive encoding of ways Checkpointing since I have something that works. Future optimizations: - when all high ints are the same, don't encode them - compression * more efficient if high ints are all the same * extract mmap_allocator.cpp This is needed to unit test the way store without dragging in osm_store. * progress on publishGroup checkpointing, going to extract a populateMask(...) function * add populateMask function * finish publishGroup * SortedWayStore: implement at * pass node store into SortedWayStore * fix alignment * better logs * way stores should throw std::out_of_range This is part of the contract, client code will catch it and reject relations that have missing ways. * sortednodestore: throw std::out_of_range * support way compression * remove dead code, robust against empty ways * implement clear() * maybe fix windows build? very unclear why this is needed, but we seem to be getting C2131 on this line. * don't use variable-length arrays on stack Workaround for MSVC * avoid more variable-length arrays * make the other vectors as thread-local * --no-compress-ways, --no-compress-nodes |