tilemaker

PublicArchive/tilemaker

Fork 0

mirror of https://github.com/systemed/tilemaker.git synced 2026-05-06 16:30:00 -04:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Colin Dellow	52b62dfbd5	some memory and concurrency improvements (#612 ) * extract ClipCache to own file Some housekeeping: extract clip_cache.cpp * templatize ClipCache, apply to MultiLineStrings This provides a very small benefit. I think the reason is two-fold: there aren't many multilinestrings (relative to multipolygons), and clipping them is less expensive. Still, it did seem to provide a small boost, so leaving it in. * housekeeping: move test, minunit * --log-tile-timings: verbose timing logs This isn't super useful to end users, but is useful for developers. If it's not OK to leave it in, let me know & I'll revert it. You can then process the log: ```bash $ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt \| grep ' took ' \| sort -nk3 \| grep z$x/ \| awk 'BEGIN { min = 999999; max = 0; }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done z0 1 7.04769 7.04769 7.047685 (z0/0/0) z1 1 9.76067 9.76067 9.760671 (z1/0/0) z2 1 9.98514 9.98514 9.985141 (z2/1/1) z3 1 9.98514 9.98514 9.985141 (z3/2/2) z4 2 14.4699 7.23493 8.610035 (z4/5/5) z5 2 20.828 10.414 13.956526 (z5/10/11) z6 5 6464.05 1292.81 3206.252711 (z6/20/23) z7 13 11306.4 869.727 3275.475707 (z7/40/46) z8 35 15787.1 451.061 2857.506681 (z8/81/92) z9 86 20723.8 240.974 1605.788985 (z9/162/186) z10 277 25456.8 91.9018 778.311785 (z10/331/369) z11 960 28851.3 30.0534 627.351078 (z11/657/735) z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471) z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943) z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916) ``` This shows each zoom's # of tiles, total time, average time, worst case time (and the tile that caused it). In general, lower zooms are slower than higher zooms. This seems intuitively reasonable, as the lower zoom often contains all of the objects in the higher zoom. I would have guessed that a lower zoom would cost 4x the next higher zoom on a per-tile basis. That's sort of the case for `z12->z13`, `z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms, where it's more like a 2x cost. Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile. This is probably because `water` has a minzoom of 6. This all makes me think that the next big gain will be from re-using simplifications. This is sort of the mirror image of the clip cache: - the clip cache avoids redundant clipping, and needs to be computed from lower zooms to higher zooms - a simplification cache could make simplifying cheaper, but needs to be computed from higher zooms to lower zooms The simplification cache also has two other wrinkles: 1. Is it even valid? e.g. is `simplify(object, 4)` the same as `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the same, because users are already accepting that we're losing accuracy when we simplify. 2. Rendering an object at `z - 1`, needds to (potentially) stitch together that object from 4 tiles at `z`. If those have each been simplified, we may introduce odd seams where the terminal points don't line up. * more, smaller caches; run destructors outside lock * use explicit types * don't populate unnecessary vectors * reserve vectors appropriately * don't eagerly call way:IsClosed() This saves a very little bit of time, but more importantly, tees up lazily evaluating the nodes in a way. * remove locks from geometry stores Rather than locking on each store call, threads lease a range of the ID space for points/lines/multilines/polygons. When the thread ends, it return the lease. This has some implications: - the IDs are no longer guaranteed to be contiguous - shapefiles are a bit weird, as they're loaded on the main thread -- so their lease won't be returned until program end. This is fine, just pointing it out. This didn't actually seem to affect runtime that much on my 16 core machine, but I think it'd help on machines with more cores. * increase attributestore shards When scaling to 32+ cores, this shows up as an issue. Try a really blunt hammer fix. * read_pbf: less lock contention on status `std::cout` has some internal locks -- instead, let's synchronize explicitly outside of it so we control the contention. If a worker fails to get the lock, just skip that worker's update. * tile_worker: do syscall 1x/thread, not 1x/tile * tilemaker: avoid lock contention on status update If a worker can't get the lock, just skip their update. * Revert "don't eagerly call way:IsClosed()" This reverts commit `3e7b9b62d1`. This commit came about from some experiments that I had done pre-SortedNodeStore. In that world, lazily evaluating the nodes of a way provided a meaningful savings if the way was ultimately discarded by the Lua code. Post-SortedNodeStore, it doesn't seem to matter as much. Which is great, as it means the store is much faster, but also means this commit is just noise. You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes * update ifdef guard, add comments * lazy way geometries Tilemaker previously stored the 2D geometries it produced from ways. This commit makes Tilemaker use the OSM way store to generate linestrings and polygons that originated with an OSM way. You can get the old behaviour with `--materialize-geometries`, which is a sensible choice if you are not memory constrained. For GB: before (available via `--materialize-geometries`): 2m11s, 9600MB this commit: 2m20s, 6400MB So ~8% slower, but 33% less memory. I think it's probably reasonable for this to be the default, which has nice symmetry with compressed nodes and compressed ways being the default. Building NA with --store still seems OK - 36min. I was concerned that the increased node store lookups could be vulnerable to thrashing. I do see some stuttering during tile writing, but I think the decreased read iops from a smaller geometry store balance out the increased read iops from looking up nodes. A future investigation might be to have SortedWayStore store latplons rather than node IDs -- a bit more memory, but should be less CPU and less vulnerable to thrashing. * improve tile coordinate generation Before writing, we compute the set of tiles to be written. There were two opportunities for improvement here: - determining which tiles were covered by our objects: we previously used a `std::set`, which has poor memory and runtime behaviour. Instead, use a fixed size `std::vector<bool>` -- this takes 64MB at z14, but gives much faster insert/query times - determining if tiles were covered by clipping box: we used boost's intersect algorithm before, which required constructing a TileBbox and was a bit slow. In the case where the tile is contained in a z6 tile that is wholly covered by the clipping box, we can short-circuit This has the most impact when the set of objects or tiles is very large--e.g. Antarctica, North America or bigger. * SortedNodeStore: only do arena allocations On a 48-core server, I noticed lock contention on the mmap allocator. So let's just always use pools of memory, and pick a bigger pool size. This means we'll sometimes allocate memory that we don't use. In the extreme case of Monaco, we only need like 200KB, but we'll allocate several megs. As you scale to larger PBFs, the waste trends to 0%, so this should be fine in practice. * remove TODO * fix Windows build D'oh, clock_gettime is Linux-ish. `std::chrono` may have a cross-platform option, but it's not clear. For now, just omit this functionality on Windows. If we want to expose it, we can explore something in std::chrono or make a wrapper that calls QueryPerformanceCounter on Windows. * sigh * fix bounds check	2023-12-15 17:04:46 +00:00

Colin Dellow

52b62dfbd5

some memory and concurrency improvements (#612 )

* extract ClipCache to own file

Some housekeeping: extract clip_cache.cpp

* templatize ClipCache, apply to MultiLineStrings

This provides a very small benefit. I think the reason is two-fold:
there aren't many multilinestrings (relative to multipolygons), and
clipping them is less expensive.

Still, it did seem to provide a small boost, so leaving it in.

* housekeeping: move test, minunit

* --log-tile-timings: verbose timing logs

This isn't super useful to end users, but is useful for developers.

If it's not OK to leave it in, let me know & I'll revert it.

You can then process the log:

```bash
$ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt  | grep ' took ' | sort -nk3  | grep z$x/ | awk 'BEGIN { min = 999999; max = 0;  }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done

z0 1 7.04769 7.04769 7.047685 (z0/0/0)
z1 1 9.76067 9.76067 9.760671 (z1/0/0)
z2 1 9.98514 9.98514 9.985141 (z2/1/1)
z3 1 9.98514 9.98514 9.985141 (z3/2/2)
z4 2 14.4699 7.23493 8.610035 (z4/5/5)
z5 2 20.828 10.414 13.956526 (z5/10/11)
z6 5 6464.05 1292.81 3206.252711 (z6/20/23)
z7 13 11306.4 869.727 3275.475707 (z7/40/46)
z8 35 15787.1 451.061 2857.506681 (z8/81/92)
z9 86 20723.8 240.974 1605.788985 (z9/162/186)
z10 277 25456.8 91.9018 778.311785 (z10/331/369)
z11 960 28851.3 30.0534 627.351078 (z11/657/735)
z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471)
z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943)
z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916)
```

This shows each zoom's # of tiles, total time, average time, worst case
time (and the tile that caused it).

In general, lower zooms are slower than higher zooms. This seems
intuitively reasonable, as the lower zoom often contains all of
the objects in the higher zoom.

I would have guessed that a lower zoom would cost 4x the next higher
zoom on a per-tile basis. That's sort of the case for `z12->z13`,
`z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms,
where it's more like a 2x cost.

Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile.
This is probably because `water` has a minzoom of 6.

This all makes me think that the next big gain will be from re-using
simplifications.

This is sort of the mirror image of the clip cache:

- the clip cache avoids redundant clipping, and needs to be computed
  from lower zooms to higher zooms

- a simplification cache could make simplifying cheaper, but needs to
  be computed from higher zooms to lower zooms

The simplification cache also has two other wrinkles:

1. Is it even valid? e.g. is `simplify(object, 4)` the same as
   `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the
   same, because users are already accepting that we're losing accuracy
   when we simplify.

2. Rendering an object at `z - 1`, needds to (potentially) stitch together
   that object from 4 tiles at `z`. If those have each been simplified,
   we may introduce odd seams where the terminal points don't line up.

* more, smaller caches; run destructors outside lock

* use explicit types

* don't populate unnecessary vectors

* reserve vectors appropriately

* don't eagerly call way:IsClosed()

This saves a very little bit of time, but more importantly, tees up
lazily evaluating the nodes in a way.

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

This didn't actually seem to affect runtime that much on my 16 core
machine, but I think it'd help on machines with more cores.

* increase attributestore shards

When scaling to 32+ cores, this shows up as an issue. Try a really
blunt hammer fix.

* read_pbf: less lock contention on status

`std::cout` has some internal locks -- instead, let's synchronize
explicitly outside of it so we control the contention.

If a worker fails to get the lock, just skip that worker's update.

* tile_worker: do syscall 1x/thread, not 1x/tile

* tilemaker: avoid lock contention on status update

If a worker can't get the lock, just skip their update.

* Revert "don't eagerly call way:IsClosed()"

This reverts commit 3e7b9b62d1.

This commit came about from some experiments that I had done
pre-SortedNodeStore.

In that world, lazily evaluating the nodes of a way provided a
meaningful savings if the way was ultimately discarded by the Lua
code.

Post-SortedNodeStore, it doesn't seem to matter as much. Which is great,
as it means the store is much faster, but also means this commit is
just noise.

You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes

* update ifdef guard, add comments

* lazy way geometries

Tilemaker previously stored the 2D geometries it produced from ways.

This commit makes Tilemaker use the OSM way store to generate linestrings
and polygons that originated with an OSM way. You can get the old
behaviour with `--materialize-geometries`, which is a sensible choice if
you are not memory constrained.

For GB:

before (available via `--materialize-geometries`): 2m11s, 9600MB
this commit:  2m20s, 6400MB

So ~8% slower, but 33% less memory.

I think it's probably reasonable for this to be the default, which has
nice symmetry with compressed nodes and compressed ways being the
default.

Building NA with --store still seems OK - 36min. I was concerned that
the increased node store lookups could be vulnerable to thrashing.
I do see some stuttering during tile writing, but I think the decreased
read iops from a smaller geometry store balance out the increased
read iops from looking up nodes. A future investigation might be to
have SortedWayStore store latplons rather than node IDs -- a bit
more memory, but should be less CPU and less vulnerable to thrashing.

* improve tile coordinate generation

Before writing, we compute the set of tiles to be written.

There were two opportunities for improvement here:

- determining which tiles were covered by our objects: we previously
  used a `std::set`, which has poor memory and runtime behaviour.
  Instead, use a fixed size `std::vector<bool>` -- this takes 64MB
  at z14, but gives much faster insert/query times

- determining if tiles were covered by clipping box: we used
  boost's intersect algorithm before, which required constructing
  a TileBbox and was a bit slow. In the case where the tile is
  contained in a z6 tile that is wholly covered by the clipping
  box, we can short-circuit

This has the most impact when the set of objects or tiles is very
large--e.g. Antarctica, North America or bigger.

* SortedNodeStore: only do arena allocations

On a 48-core server, I noticed lock contention on the mmap allocator.

So let's just always use pools of memory, and pick a bigger pool size.

This means we'll sometimes allocate memory that we don't use.

In the extreme case of Monaco, we only need like 200KB, but we'll
allocate several megs.

As you scale to larger PBFs, the waste trends to 0%, so this should
be fine in practice.

* remove TODO

* fix Windows build

D'oh, clock_gettime is Linux-ish. `std::chrono` may have a
cross-platform option, but it's not clear.

For now, just omit this functionality on Windows. If we want to expose
it, we can explore something in std::chrono or make a wrapper that
calls QueryPerformanceCounter on Windows.

* sigh

* fix bounds check

2023-12-15 17:04:46 +00:00

1 Commits