Commit Graph

246 Commits

Author SHA1 Message Date
systemed 2b25f4ee8f Add per-layer ability to disable multipoints 2024-10-13 22:17:33 +01:00
John Bayly 6ecacf4646 Add optional --quiet flag to tilemaker (#754) 2024-10-13 21:37:08 +01:00
John Bayly a9681849dd Add AllKeys and AllTags Lua functions (#755)
See systemed/tilemaker#748
2024-10-13 21:35:18 +01:00
Colin Dellow fe8399e522 use libdeflate rather than zlib (#769) 2024-10-13 18:03:44 +01:00
Colin Dellow 88b8b6f6b3 be less chatty when run non-interactively (#767) 2024-10-13 15:41:32 +01:00
Colin Dellow 76ef3e8232 don't require any Lua functions (#770) 2024-10-13 15:40:54 +01:00
Colin Dellow e42aaa7516 faster Intersects queries (#765) 2024-10-03 09:06:39 +01:00
Colin Dellow 7f03430456 Fix #750: allow no more than 512 attribute names (#760)
This fixes two issues:

- use an unsigned type, so we can use the whole 9 bits and have 512
  keys, not 256
- fix the bounds check in AttributeKeyStore to reflex the lower
  threshold that was introduced in #618

Hat tip @oobayly for reporting this.
2024-09-21 14:10:28 +01:00
Colin Dellow 6509f0cf50 Boost 186 (#759) 2024-09-20 17:20:48 +01:00
Richard Fairhurst eab08d189a Fix indexing nodes when basezoom>14 (#728) 2024-06-09 12:52:38 +01:00
Richard Fairhurst 1c0638fc45 Fix reading bools from shapefiles (#715) 2024-05-08 17:55:37 +01:00
Colin Dellow ad86ab4a01 remove FAT_TILE_INDEX (#700) 2024-04-01 17:24:47 +01:00
Richard Fairhurst a393cef6f6 Faster polygon combining (#681) 2024-02-17 11:07:14 +00:00
Richard Fairhurst d8fbd52661 Merge pull request #668 from cldellow/geojson-write-points-and-linestrings
GeoJSON writer: support points and linestrings
2024-02-10 21:03:54 +00:00
Colin Dellow 6951cd7346 windows build: uint -> uint8_t 2024-02-06 21:02:59 -05:00
Colin Dellow 93f618574f AttributePair: use uint, not char
Since we only allocate 4 bits and `char` is signed, the usable space
was -8..7. Larger values (like, say, 12) overflow, and get interpreted
as a negative value, which means they don't act as a filter, since all
zoom values are natural numbers.

The tests didn't actually test that a zoom value could be roundtripped.
I updated them, and verified they failed before the code change, and
passed after the code change.

I've also allocated an extra bit so that we support minzooms up to z31,
vs just up to z15, since I think (?) some people generate up to z16
2024-02-06 20:54:20 -05:00
Colin Dellow 537b9b1fe4 GeoJSON writer: support points and linestrings 2024-02-04 01:03:31 -05:00
Colin Dellow 5b96711a09 --shard-stores and --compact aren't compatible
CompactNodeStore doesn't know how to compute if it contains a node,
which is a prerequisite for sharding.

The two settings don't make much sense together: sharding will create N
CompactNodeStores, which each will take as much memory as a single one,
since each will likely have a large node ID.

This differs from BinarySearchNodeStore and SortedNodeStore, where each of
the N store instances will take roughly 1/N memory.

Instead:

- fail faster and more clearly by throwing if CompactNodeStore.contains is
  called
- don't enable sharding if --compact is passed
2024-01-26 20:15:34 -05:00
Colin Dellow d3de0fb4b1 warn when --compact used on non-renumbered PBF
A future commit will make it safe to do so, but it's still
memory-inefficient to use --compact on a non-renumbered PBF
2024-01-26 19:56:51 -05:00
Colin Dellow ac37ce3faf parallelize geojsonl reading
For a ~300MB geojson file of mine, this decreases wall clock time from
2.5s to 1s.
2024-01-21 12:46:46 -05:00
Colin Dellow 5b8f009a1e support GeoJSON lines format
Described at https://stevage.github.io/ndgeojson/; each feature is given
its own line, rather than being wrapped in a FeatureCollection.
2024-01-20 11:05:00 -05:00
Colin Dellow dedbfe7db2 Centroid(...) should return nil on error
Previously, calling Centroid(...) on an invalid geometry (such as
https://www.openstreetmap.org/relation/9769005, which I think gets
simplified to having 0 rings) would throw, killing the lua process.

Instead, return nil.
2024-01-18 18:55:10 -05:00
systemed 4230c3db2d Simplify runtime options
This changes the default to lazy geometries for both in-memory and on-disk.
--fast selects materialized geometries when running in memory, and unsharded
stores when running on disk.
2024-01-14 23:20:13 +00:00
Colin Dellow 7c8abef5e2 V3 post relation scan (#642) 2024-01-14 17:34:41 +00:00
Colin Dellow 5511d7680a fixup LayerAsCentroid(...)
Eep, two fixes here as well:

- I had rejigged how the skipping of LayerAsCentroid's algorithm
  argument worked; this rejigging ultimately broke it entirely, as `i`
  would never get incremented.

- If `way_keys` is provided, we are no longer guaranteed that we'll have
  stored the `label` node of the  relation
2024-01-12 00:04:02 -05:00
Colin Dellow b77340c870 remove PossiblyKnownTagValue
When I replaced #604 with #626, I botched extracting this part of the
code. I had the trait, which taught kaguya how to serialize
`PossiblyKnownTagValue`, but I missed updating the parameter type
of `Attribute` to actually use it, so it was a no-op.

This PR restores the behaviour of avoiding string copies, but now that
we have protozero's data_view class, we can use that rather than
our own weirdo struct.
2024-01-11 23:29:50 -05:00
systemed 9080cdb10e Merge branch 'way_keys' into v3 2024-01-10 18:01:27 +00:00
systemed 4051605a49 Remove checks for obsolete Boost versions 2024-01-10 17:32:53 +00:00
systemed 73ac9ce1fe Remove Mapsplit support 2024-01-10 17:00:04 +00:00
systemed e768595c4b Merge branch 'lua_interop' into v3 2024-01-10 15:44:09 +00:00
Colin Dellow 3ff08a3bc3 Intersects: faster queries for negative case (#635) 2024-01-08 20:01:01 +00:00
Colin Dellow abb2e95f83 Merge remote-tracking branch 'origin/master' into lua-interop-3 2024-01-07 13:51:46 -05:00
Colin Dellow a600524c90 extend NextRelation/FindInRelation to nodes (#632) 2024-01-07 16:51:08 +00:00
Richard Fairhurst 65829e48cd GeoJSON as alternative to shapefiles (#630) 2024-01-01 23:08:08 +00:00
Colin Dellow 2bb131b5c4 run Docker build on PRs (#627) 2023-12-30 13:43:35 +00:00
Colin Dellow 3c1740ad4d generalize node_keys; add way_keys
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes #402.

I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in https://github.com/systemed/tilemaker/issues/190
and, elsewhere, in https://github.com/onthegomap/planetiler/issues/99

If you feel it complicates the maintainer story too much, please reject.

The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.

For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM. Or, to build a map of
national/provincial parks, 12M nodes and ~120MB of RAM.

Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But if you're me, flailing about in the OSM data
model, it's convenient to be able to tweak something in the Lua script
and observe the results without having to re-filter the PBF and update
your tilemaker command to use the new PBF.

Sample use cases:

```lua
-- Building a map without building polygons, ~ excludes ways whose
-- only tags are matched by the filter.
way_keys = {"~building"}
```

```lua
-- Building a railway map
way_keys = {"railway"}
```

```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```

Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.

A concrete example, given a Lua script like:

```lua
function way_function()
  if Find("railway") ~= "" then
    Layer("lines", false)
  end
end
```

it takes 13GB of RAM and 100 seconds to process North America.

If you add:

```lua
way_keys = {"railway"}
```

It takes 2GB of RAM and 47 seconds.

Notes:

1. This is based on `lua-interop-3`, as it interacts with files that are
   changed by that. I can rebase against master after lua-interop-3 is
   merged.

2. The names `node_keys` and `way_keys` are perhaps out of date, as they
   can now express conditions on the values of tags in addition to their
   keys. Leaving them as-is is nice, as it's not a breaking change.
   But if breaking changes are OK, maybe these should be
   `node_filters` and `way_filters` ?

3. Maybe the value for `node_keys` in the OMT profile should be
   expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`?
   This would avoid issues like https://github.com/systemed/tilemaker/issues/337

4. This also adds a SIGUSR1 handler during OSM processing, which prints
   the ID of the object currently being processed. This is helpful for
   tracking down slow geometries.
2023-12-29 18:02:11 -05:00
Colin Dellow 6ba38b056d buffer objects when object index contended 2023-12-28 16:43:16 -05:00
Colin Dellow 515a0211e0 add thread-local cache for attributepairs 2023-12-28 16:43:16 -05:00
Colin Dellow f6807c4a2c move duplicate attribute handling outside of locks 2023-12-28 16:43:16 -05:00
Colin Dellow c87497dfa2 RelationScanStore: more granular locks
On a 48-core machine, this phase currently achieves only 400% CPU usage,
I think due to these locks
2023-12-28 16:43:16 -05:00
Colin Dellow 89f43ea7f3 try to avoid lock contention on AttributeStore
On a 48-core machine, I still see lots of lock contention.
AttributeStore:add is one place.

Add a thread-local cache that can be consulted without taking the shared
lock. The intuition here is that there are 1.3B objects, and 40M
attribute sets. Thus, on average, an attribute set is reused 32 times.

However, average is probably misleading -- the distribution is likely not
uniform, e.g. the median attribute set is probably reused 1-2 times, and
some exceptional attribute sets (e.g. `natural=tree` are reused thousands of times).

For GB on a 16-core machine, this avoids 27M of 36M locks.
2023-12-28 16:43:16 -05:00
Colin Dellow c4518f3cca faster tag map, faster Find()/Holds(), avoid mallocs
Cherry-picked from
https://github.com/systemed/tilemaker/pull/604/commits/b3221667a9d2366410dbfdc7f25f3062d7a135ef,
https://github.com/systemed/tilemaker/pull/604/commits/5c807a9841b866c6dc403141effd4c9d14459034,
https://github.com/systemed/tilemaker/pull/604/commits/13b3465f1c80052aa2d622e3915af08b8c5eae9a
and fixed up to work with protozero's data_view structure.

Original commit messages below, the timings will vary but the idea is
the same:

Faster tagmap
=====

Building a std::map for tags is somewhat expensive, especially when
we know that the number of tags is usually quite small.

Instead, use a custom structure that does a crappy-but-fast hash
to put the keys/values in one of 16 buckets, then linear search
the bucket.

For GB, before:
```
real 1m11.507s
user 16m49.604s
sys 0m17.381s
```

After:
```
real	1m9.557s
user	16m28.826s
sys	0m17.937s
```

Saving 2 seconds of wall clock and 20 seconds of user time doesn't
seem like much, but (a) it's not nothing and (b) having the tags
in this format will enable us to thwart some of Lua's defensive
copies in a subsequent commit.

A note about the hash function: hashing each letter of the string
using boost::hash_combine eliminated the time savings.

Faster Find()/Holds()
=====

We (ab?)use kaguya's parameter serialization machinery. Rather than
take a `std::string`, we take a `KnownTagKey` and teach Lua how to
convert a Lua string into a `KnownTagKey`.

This avoids the need to do a defensive copy of the string when coming
from Lua.

It provides a modest boost:

```
real 1m8.859s
user 16m13.292s
sys 0m18.104s
```

Most keys are short enough to fit in the small-string optimization, so
this doesn't help us avoid mallocs. An exception is `addr:housenumber`,
which, at 16 bytes, exceeds g++'s limit of 15 bytes.

It should be possible to also apply a similar trick to the `Attribute(...)`
functions, to avoid defensive copies of strings that we've seen as keys
or values.

avoid malloc for Attribute with long strings
=====

After:

```
real	1m8.124s
user	16m6.620s
sys	0m16.808s
```

Looks like we're solidly into diminishing returns at this point.
2023-12-28 16:43:16 -05:00
Colin Dellow ae1981b0f0 use vtzero instead of libprotobuf (#625) 2023-12-28 21:31:01 +00:00
Colin Dellow 12ed2414d9 use protozero (#623) 2023-12-28 20:07:50 +00:00
Colin Dellow d62c480deb be able to render the planet with 32gb of RAM (#618)
* move OutputObjects to mmap store

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB
of RAM.

* treat objects at low zoom specially

For GB, ~0.3% of objects are visible at low zooms.

I noticed in previous planet runs that fetching the objects for tiles in
the low zooms was quite slow - I think it's because we're scanning 1.3B
objects each time, only to discard most of them. Now we'll only be
scanning ~4M objects per tile, which is still an absurd number, but
should mitigate most of the speed issue without having to properly
index things.

This will also help us maintain performance for memory-constrained
users, as we won't be scanning all 15GB of data on disk, just a smaller
~45MB chunk.

* make more explicit that this is unexpected

* extend --materialize-geometries to nodes

For Points stored via Layer(...) calls, store the node ID in the
OSM store, unless `--materialize-geometries` is present.

This saves ~200MB of RAM for North America, so perhaps 1 GB for the
planet if NA has similar characteristics as the planet.

Also fix the OSM_ID(...) macro - it was lopping off many more bits
than needed, due to some previous experiments. Now that we want to track
nodes, we need at least 34 bits.

This may pose a problem down the road when we try to address thrashing.
The mechanism I hoped to use was to divide the OSM stores into multiple
stores covering different low zoom tiles. Ideally, we'd be able to
recall which store to look in -- but we only have 36 bits, we need 34
to store the Node ID, so that leaves us with 1.5 bits => can divide into
3 stores.

Since the node store for the planet is 44GB, dividing into 3 stores
doesn't give us very much headroom on a 32 GB box. Ah well, we can
sort this out later.

* rejig AttributePair layout

On g++, this reduces the size from 48 bytes to 34 bytes.

There aren't _that_ many attribute pairs, even on the planet scale, but
this plus a better encoding of string attributes might save us ~2GB at
the planet level, which is meaningful for a 32GB box

* fix initialization order warning

* add PooledString

Not used by anything yet. Given Tilemaker's limited needs, we can get
away with a stripped-down string class that is less flexible than
std::string, in exchange for memory savings.

The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang).

When it does allocate (for strings longer than 15 bytes), it allocates
from a pool so there's less per-allocation overhead.

* add tests for attribute store

...I'm going to replace the string implementation, so let's have some
backstop to make sure I don't break things

* rejig isHot

Break dependency on AttributePair, just work on std::string

* teach PooledString to work with std::string

...this will be useful for doing map lookups when testing if an
AttributePair has already been created with the given value.

* use PooledString in AttributePair

AttributePair has now been trimmed from 48 bytes to 18 bytes. There are
40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs,
so hopefully this is a savings of ~900MB at the planet level.

Runtime doesn't seem affected.

There's a further opportunity for savings if we can make more strings
qualify for the short string optimization. Only about 40% of strings
fit in the 15 byte short string optimization.

Of the remaining 60%, many are Latin-alphabet title cased strings like
`Wellington Avenue` -- this could be encoded using 5 bits per letter,
saving us an allocation.

Even in the most optimistic case where:

- there are 30M AttributePairs
- of these, 90% are strings (= 27M)
- of these, 60% don't fit in SSO (=16m)
- of these, we can make 100% fit in SSO

...we only save about 256MB at the planet level, but at some significant
complexity cost. So probably not worth pursuing at the moment.

* log timings

When doing the planet, especially on a box with limited memory, there
are long periods with no output. Show some output so the user doesn't
think things are hung.

This also might be useful in detecting perf regressions more granularly.

* AppendVector: an append-only chunked vector

When using --store, deque is nice because growing doesn't require
invalidating the old storage and copying it to a new location.

However, it's also bad, because deque allocates in 512-byte chunks,
which causes each 4KB OS page to have data from different z6 tiles.

Instead, use our own container that tries to get the best of both worlds.

Writing a random access iterator is new for me, so I don't trust this
code that much. The saving grace is that the container is very limited,
so errors in the iterator impelementation may not get exercised in
practice.

* fix progress when --store present

* mutex on RelationScan progress output

* make NodeStore/WayStore shardable

This adds three methods to the stores:

- `shard()` returns which shard you are
- `shards()` returns how many shards total
- `contains(shard, id)` returns whether or not shard N has an item with
  id X

SortedNodeStore/SortedWayStore are not implemented yet, that'll come in
a future commit.

This will allow us to create a `ShardedNodeStore` and `ShardedWayStore`
that contain N stores. We will try to ensure that each store has data
that is geographically close to each other.

Then, when reading, we'll do multiple passes of the PBF to populate each store.
This should let us reduce the working set used to populate the stores,
at the cost of additional linear scans of the PBF. Linear scans of disk
are much less painful than random scans, so that should be a good trade.

* add minimal SortedNodeStore test

I'm going to rejig the innards of this class, so let's have some tests.

* stop using internal linkage for atomics

In order to shard the stores, we need to have multiple instances
of the class.

Two things block this currently: atomics at file-level, and
thread-locals.

Moving the atomics to the class is easy.

Making the thread-locals per-class will require an approach similar
to that adopted in
https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48,
where we have a container that tracks the per-class data.

* SortedNodeStore: abstract TLS behind storage()

Still only supports 1 class, but this is a step along the path.

* SortedWayStore: abstract TLS behind storage()

* SortedNodeStore: support multiple instances

* SortedWayStorage: support multiple instances

* actually fix the low zoom object collection

D'oh, this "worked" due to two bugs cancelling each other:

(a) the code to find things in the low zoom list never found anything,
    because it assumed a base z6 tile of 0/0

(b) we weren't returning early, so the normal code still ran

Rejigged to actually do what I was intending

* AppendVector tweaks

* more low zoom fixes

* implement SortedNodeStore::contains

* implement SortedWayStore::contains

* use TileCoordinatesSet

* faster covered tile enumeration

Do a single pass,  rather than one pass per zoom.

* add ShardedNodeStore

This distributes nodes into one of 8 shards, trying to roughly group
parts of the globe by complexity.

This should help with locality when writing tiles.

A future commit will add a ShardedWayStore and teach read_pbf to read in
a locality-aware manner, which should help when reading ways.

* add ShardedWayStore

Add `--shard-stores` flag.

It's not clear yet this'll be a win, will need to benchmark.

The cost of reading the PBF blocks repeatedly is a bit higher than I was
expecting. It might be worth seeing if we can index the blocks to skip
fruitless reads.

* fewer, more balanced shards

* skip ReadPhase::Ways passes if node store is empty

* support multiple passes for ReadPhase::Relations

* fix check for first way

* adjust shards

With this distribution, no node shard is more than ~8.5GB.

* Relations: fix effectiveShards > 1 check

Oops, bug that very moderately affected performance in the non
`--shard-stores` case

* extend --materialize-geometries to LayerAsCentroid

It turns out that about 20% of LayerAsCentroid calls are for nodes,
which this branch could already do.

The remaining calls are predominantly ways, e.g. housenumbers.

We always materialize relation centroids, as they're expensive to
compute.

In GB, this saves about 6.4M points, ~102M. Scaled to the planet, it's
perhaps a 4.5GB savings, which should let us use a more aggressive shard
strategy.

It seems to add 3-4 seconds to the time to process GB.

* add `DequeMap`, change AttributeStore to use it

This implements the idea in https://github.com/systemed/tilemaker/issues/622#issuecomment-1866813888

Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`,
store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per
AttributePair and AttributeSet.

* capture s(this)

Seems to save ~1.5 seconds on GB

* fix warning

* fix warning, really

* fewer shards

Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is
3.7GB. Even ignoring the memory savings in the recent commits, these
could be merged.

* extract option parsing to own file

We'd like to have different defaults based on whether `--store` is
present. Now that option parsing will have some more complex logic,
let's pull it into its own class so it can be more easily tested.

* use sensible defaults based on presence of --store

* improve test coverage

* fixes

* update number of shards to 6

This has no performance impact as we never put anything in the 7th
shard, and so we skip doing the 7th pass in the ReadPhase::Ways and
ReadPhase::Relations phase.

The benefit is only to avoid emitting a noisy log about how the 7th store
has 0 entries in it.

Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb

The new peak memory is ~12.2GB.

I am a little perplexed -- the runtime on a 16-core server was
previously:

```
$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores
real	195m7.819s
user	2473m52.322s
sys	73m13.116s
```

But with the most recent commits on this branch, it was:

```
real	118m50.098s
user	1531m13.026s
sys	34m7.252s
```

This is incredibly suspicious. I also tried re-running commit
bbf0957c1e, and got:

```
real	123m15.534s
user	1546m25.196s
sys	38m17.093s
```

...so I can't explain why the earlier runs took 195 min.

Ideas:

- the planet changed between runs, and a horribly broken geometry was
  fixed

- Vultr gives quite different machines for the same class of server

- perhaps most likely: I failed to click "CPU-optimized" when picking
  the earlier server, and got a slow machine the first time, and a fast
  machine the second time. I'm pretty sure I paid the same $, so I'm
  not sure I believe this.

I don't think I really believe that a 33% reduction in runtime is
explained by any of those, though. Anyway, just another thing to
be befuddled by.

* --store uses lazy geometries; permit overriding

I did some experiments on a Hetzner 48-core box with 192GB of RAM:

--store, materialize geometries:
real 65m34.327s
user 2297m50.204s
sys 65m0.901s

The process often failed to use 100% of CPU--if you naively divide
user+sys/real you get ~36, whereas the ideal would be ~48.

Looking at stack traces, it seemed to coincide with calls to Boost's
rbtree_best_fit allocator.

Maybe:

- we're doing disk I/O, and it's just slower than recomputing the geometries
- we're using the Boost mmap library suboptimally -- maybe there's
  some other allocator we could be using. I think we use the mmap
  allocator like a simple bump allocator, so I don't know why we'd need
  a red-black tree

--store, lazy geometries:
real 55m33.979s
user 2386m27.294s
sys 23m58.973s

Faster, but still some overhead (user+sys/real => ~43)

no --store, materialize geometries: OOM

no --store, lazy geometries (used 175GB):
real 51m27.779s
user 2306m25.309s
sys 16m34.289s

This was almost 100% CPU - user+sys/real => ~45)

From this, I infer:

- `--store` should always default to lazy geometries in order to
  minimize the I/O burden

- `--materialize-geometries` is a good default for non-store usage,
  but it's still useful to be able to override and use lazy geometries,
  if it then means you can fit the data entirely in memory
2023-12-28 15:23:35 +00:00
Richard Fairhurst 5acee418ba PMTiles support (#620) 2023-12-22 10:45:05 +00:00
Colin Dellow 52b62dfbd5 some memory and concurrency improvements (#612)
* extract ClipCache to own file

Some housekeeping: extract clip_cache.cpp

* templatize ClipCache, apply to MultiLineStrings

This provides a very small benefit. I think the reason is two-fold:
there aren't many multilinestrings (relative to multipolygons), and
clipping them is less expensive.

Still, it did seem to provide a small boost, so leaving it in.

* housekeeping: move test, minunit

* --log-tile-timings: verbose timing logs

This isn't super useful to end users, but is useful for developers.

If it's not OK to leave it in, let me know & I'll revert it.

You can then process the log:

```bash
$ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt  | grep ' took ' | sort -nk3  | grep z$x/ | awk 'BEGIN { min = 999999; max = 0;  }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done

z0 1 7.04769 7.04769 7.047685 (z0/0/0)
z1 1 9.76067 9.76067 9.760671 (z1/0/0)
z2 1 9.98514 9.98514 9.985141 (z2/1/1)
z3 1 9.98514 9.98514 9.985141 (z3/2/2)
z4 2 14.4699 7.23493 8.610035 (z4/5/5)
z5 2 20.828 10.414 13.956526 (z5/10/11)
z6 5 6464.05 1292.81 3206.252711 (z6/20/23)
z7 13 11306.4 869.727 3275.475707 (z7/40/46)
z8 35 15787.1 451.061 2857.506681 (z8/81/92)
z9 86 20723.8 240.974 1605.788985 (z9/162/186)
z10 277 25456.8 91.9018 778.311785 (z10/331/369)
z11 960 28851.3 30.0534 627.351078 (z11/657/735)
z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471)
z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943)
z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916)
```

This shows each zoom's # of tiles, total time, average time, worst case
time (and the tile that caused it).

In general, lower zooms are slower than higher zooms. This seems
intuitively reasonable, as the lower zoom often contains all of
the objects in the higher zoom.

I would have guessed that a lower zoom would cost 4x the next higher
zoom on a per-tile basis. That's sort of the case for `z12->z13`,
`z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms,
where it's more like a 2x cost.

Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile.
This is probably because `water` has a minzoom of 6.

This all makes me think that the next big gain will be from re-using
simplifications.

This is sort of the mirror image of the clip cache:

- the clip cache avoids redundant clipping, and needs to be computed
  from lower zooms to higher zooms

- a simplification cache could make simplifying cheaper, but needs to
  be computed from higher zooms to lower zooms

The simplification cache also has two other wrinkles:

1. Is it even valid? e.g. is `simplify(object, 4)` the same as
   `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the
   same, because users are already accepting that we're losing accuracy
   when we simplify.

2. Rendering an object at `z - 1`, needds to (potentially) stitch together
   that object from 4 tiles at `z`. If those have each been simplified,
   we may introduce odd seams where the terminal points don't line up.

* more, smaller caches; run destructors outside lock

* use explicit types

* don't populate unnecessary vectors

* reserve vectors appropriately

* don't eagerly call way:IsClosed()

This saves a very little bit of time, but more importantly, tees up
lazily evaluating the nodes in a way.

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

This didn't actually seem to affect runtime that much on my 16 core
machine, but I think it'd help on machines with more cores.

* increase attributestore shards

When scaling to 32+ cores, this shows up as an issue. Try a really
blunt hammer fix.

* read_pbf: less lock contention on status

`std::cout` has some internal locks -- instead, let's synchronize
explicitly outside of it so we control the contention.

If a worker fails to get the lock, just skip that worker's update.

* tile_worker: do syscall 1x/thread, not 1x/tile

* tilemaker: avoid lock contention on status update

If a worker can't get the lock, just skip their update.

* Revert "don't eagerly call way:IsClosed()"

This reverts commit 3e7b9b62d1.

This commit came about from some experiments that I had done
pre-SortedNodeStore.

In that world, lazily evaluating the nodes of a way provided a
meaningful savings if the way was ultimately discarded by the Lua
code.

Post-SortedNodeStore, it doesn't seem to matter as much. Which is great,
as it means the store is much faster, but also means this commit is
just noise.

You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes

* update ifdef guard, add comments

* lazy way geometries

Tilemaker previously stored the 2D geometries it produced from ways.

This commit makes Tilemaker use the OSM way store to generate linestrings
and polygons that originated with an OSM way. You can get the old
behaviour with `--materialize-geometries`, which is a sensible choice if
you are not memory constrained.

For GB:

before (available via `--materialize-geometries`): 2m11s, 9600MB
this commit:  2m20s, 6400MB

So ~8% slower, but 33% less memory.

I think it's probably reasonable for this to be the default, which has
nice symmetry with compressed nodes and compressed ways being the
default.

Building NA with --store still seems OK - 36min. I was concerned that
the increased node store lookups could be vulnerable to thrashing.
I do see some stuttering during tile writing, but I think the decreased
read iops from a smaller geometry store balance out the increased
read iops from looking up nodes. A future investigation might be to
have SortedWayStore store latplons rather than node IDs -- a bit
more memory, but should be less CPU and less vulnerable to thrashing.

* improve tile coordinate generation

Before writing, we compute the set of tiles to be written.

There were two opportunities for improvement here:

- determining which tiles were covered by our objects: we previously
  used a `std::set`, which has poor memory and runtime behaviour.
  Instead, use a fixed size `std::vector<bool>` -- this takes 64MB
  at z14, but gives much faster insert/query times

- determining if tiles were covered by clipping box: we used
  boost's intersect algorithm before, which required constructing
  a TileBbox and was a bit slow. In the case where the tile is
  contained in a z6 tile that is wholly covered by the clipping
  box, we can short-circuit

This has the most impact when the set of objects or tiles is very
large--e.g. Antarctica, North America or bigger.

* SortedNodeStore: only do arena allocations

On a 48-core server, I noticed lock contention on the mmap allocator.

So let's just always use pools of memory, and pick a bigger pool size.

This means we'll sometimes allocate memory that we don't use.

In the extreme case of Monaco, we only need like 200KB, but we'll
allocate several megs.

As you scale to larger PBFs, the waste trends to 0%, so this should
be fine in practice.

* remove TODO

* fix Windows build

D'oh, clock_gettime is Linux-ish. `std::chrono` may have a
cross-platform option, but it's not clear.

For now, just omit this functionality on Windows. If we want to expose
it, we can explore something in std::chrono or make a wrapper that
calls QueryPerformanceCounter on Windows.

* sigh

* fix bounds check
2023-12-15 17:04:46 +00:00
Richard Fairhurst 6940f98346 Make shapefile processing multi-threaded (#614) 2023-12-15 16:24:56 +00:00
Richard Fairhurst 34c356a9f5 Coastline tweaks (#611) 2023-12-11 20:38:04 +00:00
Richard Fairhurst b2785690d2 GeoJSON writer for debugging (#609) 2023-12-09 21:38:39 +00:00