When I replaced #604 with #626, I botched extracting this part of the
code. I had the trait, which taught kaguya how to serialize
`PossiblyKnownTagValue`, but I missed updating the parameter type
of `Attribute` to actually use it, so it was a no-op.
This PR restores the behaviour of avoiding string copies, but now that
we have protozero's data_view class, we can use that rather than
our own weirdo struct.
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes#402.
I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in https://github.com/systemed/tilemaker/issues/190
and, elsewhere, in https://github.com/onthegomap/planetiler/issues/99
If you feel it complicates the maintainer story too much, please reject.
The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.
For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM. Or, to build a map of
national/provincial parks, 12M nodes and ~120MB of RAM.
Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But if you're me, flailing about in the OSM data
model, it's convenient to be able to tweak something in the Lua script
and observe the results without having to re-filter the PBF and update
your tilemaker command to use the new PBF.
Sample use cases:
```lua
-- Building a map without building polygons, ~ excludes ways whose
-- only tags are matched by the filter.
way_keys = {"~building"}
```
```lua
-- Building a railway map
way_keys = {"railway"}
```
```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```
Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.
A concrete example, given a Lua script like:
```lua
function way_function()
if Find("railway") ~= "" then
Layer("lines", false)
end
end
```
it takes 13GB of RAM and 100 seconds to process North America.
If you add:
```lua
way_keys = {"railway"}
```
It takes 2GB of RAM and 47 seconds.
Notes:
1. This is based on `lua-interop-3`, as it interacts with files that are
changed by that. I can rebase against master after lua-interop-3 is
merged.
2. The names `node_keys` and `way_keys` are perhaps out of date, as they
can now express conditions on the values of tags in addition to their
keys. Leaving them as-is is nice, as it's not a breaking change.
But if breaking changes are OK, maybe these should be
`node_filters` and `way_filters` ?
3. Maybe the value for `node_keys` in the OMT profile should be
expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`?
This would avoid issues like https://github.com/systemed/tilemaker/issues/337
4. This also adds a SIGUSR1 handler during OSM processing, which prints
the ID of the object currently being processed. This is helpful for
tracking down slow geometries.
Cherry-picked from
https://github.com/systemed/tilemaker/pull/604/commits/b3221667a9d2366410dbfdc7f25f3062d7a135ef,
https://github.com/systemed/tilemaker/pull/604/commits/5c807a9841b866c6dc403141effd4c9d14459034,
https://github.com/systemed/tilemaker/pull/604/commits/13b3465f1c80052aa2d622e3915af08b8c5eae9a
and fixed up to work with protozero's data_view structure.
Original commit messages below, the timings will vary but the idea is
the same:
Faster tagmap
=====
Building a std::map for tags is somewhat expensive, especially when
we know that the number of tags is usually quite small.
Instead, use a custom structure that does a crappy-but-fast hash
to put the keys/values in one of 16 buckets, then linear search
the bucket.
For GB, before:
```
real 1m11.507s
user 16m49.604s
sys 0m17.381s
```
After:
```
real 1m9.557s
user 16m28.826s
sys 0m17.937s
```
Saving 2 seconds of wall clock and 20 seconds of user time doesn't
seem like much, but (a) it's not nothing and (b) having the tags
in this format will enable us to thwart some of Lua's defensive
copies in a subsequent commit.
A note about the hash function: hashing each letter of the string
using boost::hash_combine eliminated the time savings.
Faster Find()/Holds()
=====
We (ab?)use kaguya's parameter serialization machinery. Rather than
take a `std::string`, we take a `KnownTagKey` and teach Lua how to
convert a Lua string into a `KnownTagKey`.
This avoids the need to do a defensive copy of the string when coming
from Lua.
It provides a modest boost:
```
real 1m8.859s
user 16m13.292s
sys 0m18.104s
```
Most keys are short enough to fit in the small-string optimization, so
this doesn't help us avoid mallocs. An exception is `addr:housenumber`,
which, at 16 bytes, exceeds g++'s limit of 15 bytes.
It should be possible to also apply a similar trick to the `Attribute(...)`
functions, to avoid defensive copies of strings that we've seen as keys
or values.
avoid malloc for Attribute with long strings
=====
After:
```
real 1m8.124s
user 16m6.620s
sys 0m16.808s
```
Looks like we're solidly into diminishing returns at this point.
* extract ClipCache to own file
Some housekeeping: extract clip_cache.cpp
* templatize ClipCache, apply to MultiLineStrings
This provides a very small benefit. I think the reason is two-fold:
there aren't many multilinestrings (relative to multipolygons), and
clipping them is less expensive.
Still, it did seem to provide a small boost, so leaving it in.
* housekeeping: move test, minunit
* --log-tile-timings: verbose timing logs
This isn't super useful to end users, but is useful for developers.
If it's not OK to leave it in, let me know & I'll revert it.
You can then process the log:
```bash
$ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt | grep ' took ' | sort -nk3 | grep z$x/ | awk 'BEGIN { min = 999999; max = 0; }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done
z0 1 7.04769 7.04769 7.047685 (z0/0/0)
z1 1 9.76067 9.76067 9.760671 (z1/0/0)
z2 1 9.98514 9.98514 9.985141 (z2/1/1)
z3 1 9.98514 9.98514 9.985141 (z3/2/2)
z4 2 14.4699 7.23493 8.610035 (z4/5/5)
z5 2 20.828 10.414 13.956526 (z5/10/11)
z6 5 6464.05 1292.81 3206.252711 (z6/20/23)
z7 13 11306.4 869.727 3275.475707 (z7/40/46)
z8 35 15787.1 451.061 2857.506681 (z8/81/92)
z9 86 20723.8 240.974 1605.788985 (z9/162/186)
z10 277 25456.8 91.9018 778.311785 (z10/331/369)
z11 960 28851.3 30.0534 627.351078 (z11/657/735)
z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471)
z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943)
z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916)
```
This shows each zoom's # of tiles, total time, average time, worst case
time (and the tile that caused it).
In general, lower zooms are slower than higher zooms. This seems
intuitively reasonable, as the lower zoom often contains all of
the objects in the higher zoom.
I would have guessed that a lower zoom would cost 4x the next higher
zoom on a per-tile basis. That's sort of the case for `z12->z13`,
`z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms,
where it's more like a 2x cost.
Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile.
This is probably because `water` has a minzoom of 6.
This all makes me think that the next big gain will be from re-using
simplifications.
This is sort of the mirror image of the clip cache:
- the clip cache avoids redundant clipping, and needs to be computed
from lower zooms to higher zooms
- a simplification cache could make simplifying cheaper, but needs to
be computed from higher zooms to lower zooms
The simplification cache also has two other wrinkles:
1. Is it even valid? e.g. is `simplify(object, 4)` the same as
`simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the
same, because users are already accepting that we're losing accuracy
when we simplify.
2. Rendering an object at `z - 1`, needds to (potentially) stitch together
that object from 4 tiles at `z`. If those have each been simplified,
we may introduce odd seams where the terminal points don't line up.
* more, smaller caches; run destructors outside lock
* use explicit types
* don't populate unnecessary vectors
* reserve vectors appropriately
* don't eagerly call way:IsClosed()
This saves a very little bit of time, but more importantly, tees up
lazily evaluating the nodes in a way.
* remove locks from geometry stores
Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.
This has some implications:
- the IDs are no longer guaranteed to be contiguous
- shapefiles are a bit weird, as they're loaded on the main
thread -- so their lease won't be returned until program
end. This is fine, just pointing it out.
This didn't actually seem to affect runtime that much on my 16 core
machine, but I think it'd help on machines with more cores.
* increase attributestore shards
When scaling to 32+ cores, this shows up as an issue. Try a really
blunt hammer fix.
* read_pbf: less lock contention on status
`std::cout` has some internal locks -- instead, let's synchronize
explicitly outside of it so we control the contention.
If a worker fails to get the lock, just skip that worker's update.
* tile_worker: do syscall 1x/thread, not 1x/tile
* tilemaker: avoid lock contention on status update
If a worker can't get the lock, just skip their update.
* Revert "don't eagerly call way:IsClosed()"
This reverts commit 3e7b9b62d1.
This commit came about from some experiments that I had done
pre-SortedNodeStore.
In that world, lazily evaluating the nodes of a way provided a
meaningful savings if the way was ultimately discarded by the Lua
code.
Post-SortedNodeStore, it doesn't seem to matter as much. Which is great,
as it means the store is much faster, but also means this commit is
just noise.
You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes
* update ifdef guard, add comments
* lazy way geometries
Tilemaker previously stored the 2D geometries it produced from ways.
This commit makes Tilemaker use the OSM way store to generate linestrings
and polygons that originated with an OSM way. You can get the old
behaviour with `--materialize-geometries`, which is a sensible choice if
you are not memory constrained.
For GB:
before (available via `--materialize-geometries`): 2m11s, 9600MB
this commit: 2m20s, 6400MB
So ~8% slower, but 33% less memory.
I think it's probably reasonable for this to be the default, which has
nice symmetry with compressed nodes and compressed ways being the
default.
Building NA with --store still seems OK - 36min. I was concerned that
the increased node store lookups could be vulnerable to thrashing.
I do see some stuttering during tile writing, but I think the decreased
read iops from a smaller geometry store balance out the increased
read iops from looking up nodes. A future investigation might be to
have SortedWayStore store latplons rather than node IDs -- a bit
more memory, but should be less CPU and less vulnerable to thrashing.
* improve tile coordinate generation
Before writing, we compute the set of tiles to be written.
There were two opportunities for improvement here:
- determining which tiles were covered by our objects: we previously
used a `std::set`, which has poor memory and runtime behaviour.
Instead, use a fixed size `std::vector<bool>` -- this takes 64MB
at z14, but gives much faster insert/query times
- determining if tiles were covered by clipping box: we used
boost's intersect algorithm before, which required constructing
a TileBbox and was a bit slow. In the case where the tile is
contained in a z6 tile that is wholly covered by the clipping
box, we can short-circuit
This has the most impact when the set of objects or tiles is very
large--e.g. Antarctica, North America or bigger.
* SortedNodeStore: only do arena allocations
On a 48-core server, I noticed lock contention on the mmap allocator.
So let's just always use pools of memory, and pick a bigger pool size.
This means we'll sometimes allocate memory that we don't use.
In the extreme case of Monaco, we only need like 200KB, but we'll
allocate several megs.
As you scale to larger PBFs, the waste trends to 0%, so this should
be fine in practice.
* remove TODO
* fix Windows build
D'oh, clock_gettime is Linux-ish. `std::chrono` may have a
cross-platform option, but it's not clear.
For now, just omit this functionality on Windows. If we want to expose
it, we can explore something in std::chrono or make a wrapper that
calls QueryPerformanceCounter on Windows.
* sigh
* fix bounds check
* refactor NodeStore
I'd like to add an alternative NodeStore that can be used when the
`Type_then_ID` property is present in the PBF.
First, a small (?) refactor:
- make `NodeStore` an interface, with two concrete implementations
- extract the NodeStore related things to their own files
- this will cause some churn, as they'll depend on things that also
need to get extracted to their own files. Short term pain, hopefully
long term gain in faster compile times.
Changing the invocations of the functions to be virtual may have impact
on performance. Will need to revisit that before committing to virtual
methods.
* change how work is assigned for ReadPhase::Nodes
Currently, when a worker needs work, it gets the next unprocessed block.
This means blocks are read sequentially at a global level, but from
the perspective of each worker, there are gaps in the blocks they see.
For nodes, we'd prefer to give each worker thread contiguous blocks
from the underlying PBF. This will enable a more efficient storage
for PBFs with the `Sort.Type_then_ID` flag.
* add SortedNodeStore
SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID`
property, e.g. the planet and Geofabrik exports.
It stores nodes in a hierarchy:
- Level 1 is groups: there are 256K groups
- Level 2 is chunks: each group has 256 chunks
- Level 3 is nodes: each chunk has 256 nodes
This allows us to store 2^34 nodes, with a fixed overhead of
only 2M -- the space required for the level 1 pointers.
Groups and chunks store their data sparsely. If a group has 7 chunks,
it only uses storage for 7 chunks.
On Great Britain's 184M node PBF, it needs ~9.13 bytes per node.
Looking up a node can be done in fixed time:
First, get some offsets:
- Group: `nodeID / 65536`
- Chunk: `(nodeID / 65536) / 256`
- Position within chunk: `nodeID % 256`
For example, Cape Chignecto Provincial Park has ID 4855703, giving:
- Group 74
- Chunk 23
- Offset 151
Group 74's chunks may be sparse. To map chunk 23 to its physical
location, each group has a 256-bit bitmask indicating which
chunks are present.
Use its physical location to get its `chunkOffset`. That allows you
to get to the `ChunkInfo` struct.
From there, do the same thing to get the node data.
This design should also let us do some interesting things down the road,
like efficiently compressing each chunk using something like delta
encoding, zigzag encoding and bit packing. Then, to avoid paying a
decompression cost, we'd likely give each worker a cache of uncompressed
chunks.
* cmake build
* tidy up
* tweak
* tweak
* derp
* mac/windows build
* fix build?
I don't understand why these can't be passed as a copy in the Windows
and Mac builds. Whatever, try passing a reference.
* fix --store
I think nested containers may not be wired up quite correctly.
Instead, manage the char* buffers directly, rather than as
`std::vector<char>`
I'll fixup the other aspects (attributing libpopcnt, picking
Sorted vs BinarySearch on the fly) later
* attribution for libpopcnt
* simplify read_pbf
All read phases use the same striding-over-batches-of-blocks approach.
This required changing how progress is reported, as block IDs are no
longer globally montonically increasing.
Rather than thread the state into ReadBlock, I just adopted 2 atomic
counters for the whole class -- the progress reporter already assumes
that it's the only thing dumping to stdout, so the purity of avoiding
class-global doesn't buy us anything.
* clear allocatedMemory
* use scale factor 16, not 8
D'oh, if you get a full group where each chunk is full, you need to be
able to express a value _ever so slightly_ larger than 65,536.
North America and Europe have examples of this.
Use a scale factor of 16, not 8. This'll mean some chunks have up to 15
wasted bytes, but it's not a huge deal. (And I have some thoughts on how
to claw it back.)
* comment out debug stats
* windows build
* derp
* use SortedNodeStore if PBFs have Sort.Type_then_ID
* add --compress-nodes
If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte)
to compress chunks of nodes in memory.
The impact on read time is not much:
- GB with `--compress-nodes`: 1m42s
- without: 1m35s
But the impact on memory is worthwhile, even across very different
extracts:
North America - 5.52 bytes/node vs 8.48 bytes/node
169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes
169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes
Great Britain - 5.97 bytes/node vs 9.25 bytes/node
163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes
163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes
Nova Scota - 5.81 bytes/node vs 8.7 bytes/node
26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes
26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes
Monaco - 10.43 bytes/node vs 13.52 bytes/node
1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes
1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes
* build
* build
* remove __restrict__ to satisfy windows build
* remove debug print, small memory optimization
* use an arena for small groups
* omit needless words
* better short-circuiting for Type-then-ID PBFs
Track metadata about which blocks have nodes, ways and relations.
By default, we assume any block may contain nodes, ways or relations.
If the PBF supports Type-then-ID PBFs, do a binary search to find the first
blocks with ways and relations.
This means ReadPhase::Nodes can stop without scanning ways/relations.
In addition to avoiding needless work, it makes it easier to assign
each worker a balanced amount of work -- now each worker has only
blocks with nodes, which are about the same effort computationally.
It also makes ReadPhase::ScanRelations faster, as it scans exactly the
blocks with relations, skipping the blocks with ways.
Similarly, ReadPhase::Ways is a bit faster, as it doesn't have to read
the blocks with relations.
For North America, this reduces the time to complete the Nodes and
RelationsScan phase from 2m30s to 1m20s.
For GB, it reduces the time from 22s to 9s.
* ReadPhase::Relations - more parallelism
When processing relations for small extracts, there are often fewer
blocks than cores.
Instead, divide the work more granularly, assigning each of the N
threads 1/Nth of the block to process.
This saves 4-5 seconds (which is cumulatively ~20% of runtime) for
the Canadian province of Nova Scotia.
* extract WayStore, BinarySearchWayStore
* stub in SortedWayStore
...it just throws a lot of exceptions at the moment.
* put SortedNodeStore in a namespace
Also replace some `#define`s with `const`s.
I'm likely going to reuse some names in SortedWayStore, so namespacing
to avoid conflicts.
* don't use SortedWayStore if LocationsOnWays present
* stub in insertLatpLons/insertNodes
* change at() to return a non mmap vector
SortedWayStore won't create mmaped vectors, so we need to return the
lowest common denominator.
This pessimizes performance of BinarySearchWayStore, since it'll have
to allocate vectors on demand.
Longer term: it might be better to return an iterator that hides the heavy
lifting.
* begin drawing the rest of the owl
* flesh out types
* add unit test framework
* naive encoding of ways
Checkpointing since I have something that works.
Future optimizations:
- when all high ints are the same, don't encode them
- compression
* more efficient if high ints are all the same
* extract mmap_allocator.cpp
This is needed to unit test the way store without dragging
in osm_store.
* progress on publishGroup
checkpointing, going to extract a populateMask(...) function
* add populateMask function
* finish publishGroup
* SortedWayStore: implement at
* pass node store into SortedWayStore
* fix alignment
* better logs
* way stores should throw std::out_of_range
This is part of the contract, client code will catch it and reject
relations that have missing ways.
* sortednodestore: throw std::out_of_range
* support way compression
* remove dead code, robust against empty ways
* implement clear()
* maybe fix windows build?
very unclear why this is needed, but we seem to be getting C2131 on this
line.
* don't use variable-length arrays on stack
Workaround for MSVC
* avoid more variable-length arrays
* make the other vectors as thread-local
* --no-compress-ways, --no-compress-nodes
* make AttributeStore::get const
I think AttributeStore lives forever, and AttributeSets are immutable
once added to it, so we can avoid the copy.
* use a string pool for AttributeSet keys
There are relatively few unique key values for attributes, e.g.
`kind`, `name`, `admin_level`.
The Shortbread schema has only ~50 or so. I imagine OMT is similar,
but haven't checked.
We generate lots of AttributePairs -- on the order of tens of millions
for GB, and std::string has an overhead of 32 bytes. By using a string
pool and storing only an offset into it, we can save a few hundred MB
of RAM.
* lock-free reads for keys, vector for pairs
This is the groundwork for implementing two future improvements:
- hot/cold pairs: there is a bimodal distribution of attribute
frequency. `landuse=wood`, `tunnel=0` are often duplicated.
`name=Sneed's Seed & Feed` is not.
In the future, we'll try to re-use the "hot" pairs to avoid
paying the cost of an AttributePair for them.
- "short vectors" - similar to the short string optimization,
we should be able to pack up to 6 pairs (3 hot, 3 cold) in
the overhead that a vector would otherwise use.
As it stands, this commit increases memory usage. But we'll claw
a lot of it back, and then some.
* Have a "hot" shard for popular pairs
If a pair looks like it might be re-usable, put it in a special
shard and be able to re-use it.
The special shard is limited to max 64K items, teeing up future
work to have a simple vector for AttributeSets with few pairs.
* treat 0 as a sentinel
* de-dupe all AttributePairs
The stats I was looking at were counting AttributePairs via
AttributeSets, which of course presents a misleading image
of how many duplicate AttributePairs there are, because by
that point, they've already been deduped.
De-duping doesn't add that much runtime overhead--and it could
probably be improved by someone who knows more C++ concurrency
tricks than me.
* store pointers in pairMaps, optimize debug spew
`Tile_Value` is a really memory-expensive object. Since we maintain
long-lived references to the canonical AttributePair, we can store
pointers to save a bit more memory.
Now that value->AttributePairs are guaranteed to be 1:1, we can do our
debug statistics on ints, and translate to pairs only when writing to
stdout.
* use boost::container::flat_map over std::map
Doesn't appreciably affect runtime, saves a bit of memory.
* don't memoize hash function
Now that there is a 1:1 mapping between values and AttributePairs,
it's trivial to compute the hash on demand.
* output_object: avoid Tile_Value temporaries
Also const-ify a few things
* defer creating Tile_Value
Tile_Value is a big union that takes up 96 bytes,
but for our purposes, we're happy with a union of
string, float and bool -- which can be expressed
in 28 bytes. We need a discriminator variable, but
due to alignment, that's free.
I also consider `boost::variant<bool, float, string>`,
but it seemed to take 40 bytes.
I worried that not having a pool of Tile_Values would
affect PBF writing time, but it seems unaffected.
* adjust headers, remove unneeded rng
* any integer 0 <= 25 is eligible for hot pool
This is useful for ranks, which run from 1..25
* Use a small vector optimization for pair indexes
`vector<uint32_t>` takes 24 bytes just to store its internal pointers.
If you actually want to store a `uint32_t` in it, it'll then allocate
some memory on the heap, taking a further 32-64 bytes depending on STL
and malloc implementations.
56-88 bytes! For a single `uint32_t`! Outrageous.
Instead, store references to pair indexes in an array of shorts. If
the pairs don't fit in the array, upgrade it to a vector.
Since we previously arranged for very popular pairs like `amenity=toilets`
to have small indexes, our array of shorts is capable of storing between
4 and 8 pairs before we need to upgade to a vector. Most AttributeSets
will not need to use a vector.
* simplify AttributeKeyStore
* use camelCase
* re-write to avoid static lifetime
AttributeKeyStore/AttributePairStore have the same lifetime as
AttributeStore, so just make them owned by it.
This results in slightly more convoluted code, but avoids having
them floating around as globals.
* reduce lock contention
* Improve TileCoordinates hash function
x ^ y will only use as many bits as max(x, y), but tiles
only use the full 32-bit space at z16, so we're leaving
a lot of the hash space on the table.
* d'oh, avoid looking up the key name needlessly
* change AttributeXyz(...) to be last-written wins
Previously, if you set the same key to different values, it was
not guaranteed that the last value written would win.
* remove misleading comment
* include deque
* include map
* return vector, not set
set seems a bit like overkill - we already know the items are unique,
and the consumer is likely just going to iterate over them
* avoid GNU-specific initializer
also avoid hardcoding 12
* Revert "Improve TileCoordinates hash function"
This reverts commit 7570737715.
Oops, I think this change isn't meaningful, and is a result of me
misreading the original code.
It might still be an improvement to do something like
`hash(x << 16) ^ hash(y)`, since the default TileCoordinate is only 16
bits, but that can be considered independent of this PR.
* remove dead code
* avoid copying AttributePairs
They're long-lived, so pass pointers
* OutputObjects - greatly reduce need for locks
I'm slowly remembering how to write concurrent code...
* AttributeKeyStore: use a TLS cache
This should reduce futex contention significantly. I'll
apply the same change for AttributePairStore's shard 0, then
measure.
* AttributePairStore: reduce lock contention
* ensure atomics are initialized
Per https://stackoverflow.com/questions/36320008/whats-the-default-value-for-a-stdatomic,
they aren't initialized by default. Somewhat surprised this didn't
result in crashes.
* don't store duplicate way geometries
A common pattern is:
```lua
way:Layer("waterway", false)
...
way:Layer("waterway_names", false)
```
Previously, we'd process the geometry twice, and store a second copy of
it in memory. Instead, re-use the previously stored geometry.
This saves another ~1GB of memory for the GB extract. It doesn't
seem to affect runtime - I think we only re-use linestrings, and
linestrings are relatively cheap to do `is_valid` on.
It seems like with the rest of the work on this branch, the
`OutputObjectXyz` classes are very thin -- inspecting `geomType` in
order to construct the right was a bit tedious, so I removed them.
* Add support for a z_order field and sort output objects by z_order
PostGIS import tools like Osm2pgsql and Imposm can write a z_order field
to the database table in order to allow map styles to render features
ordered by road class and vertical layer. Tilemaker now gets a Lua
callback to set the z_order for an OSM object (default 0) and will sort
the vectortile features by z_order. z_order is also taken into account
for combining of features. z_order values are limited to 1-byte unsigned
integer.
* Document new ZOrder callback function
* Drop unnecessary variable
* Implement z_order sorting for the OpenMapTiles example config
This commit also adapts the minimum zoom levels to the lastest version.
* Parallel reading
* Fix win build
* Simplifier correctly init nodes
* Use generator to open input file(s)
* Bit more consistent naming of osm_store files
* Make number of locks in attribute store function of number of threads
* Small optimization, don't use virtual method calls
* Restore nodeListPolygon correct
* Don't drop self intersecting polygons