Sarah Hoffmann [Sun, 25 Apr 2021 09:47:29 +0000 (11:47 +0200)]
move houseunumber handling to tokenizer
Normalization and token computation are now done in the tokenizer.
The tokenizer keeps a cache to the hundred most used house numbers
to keep the numbers of calls to the database low.
Sarah Hoffmann [Sat, 24 Apr 2021 20:35:46 +0000 (22:35 +0200)]
introduce name analyzer
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
Sarah Hoffmann [Fri, 23 Apr 2021 14:15:00 +0000 (16:15 +0200)]
add extra column for tokenizer
Add a jsonb column to the placex and location_property_osmline tables
which can be used by the installed tokenizer as required. No other
part of the software will use or otherwise rely on this column.
Sarah Hoffmann [Fri, 23 Apr 2021 13:49:38 +0000 (15:49 +0200)]
introduce external processing in indexer
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.
The second processing step doesn't do anything useful yet.
Sarah Hoffmann [Thu, 22 Apr 2021 20:47:34 +0000 (22:47 +0200)]
move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.
The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
Sarah Hoffmann [Wed, 21 Apr 2021 13:38:52 +0000 (15:38 +0200)]
add migration for configurable tokenizer
Adds a migration that initialises a legacy tokenizer for
an existing database. The migration is not active yet as
it will need completion when more functionality is added
to the legacy tokenizer.
Sarah Hoffmann [Wed, 21 Apr 2021 07:57:17 +0000 (09:57 +0200)]
introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.
The legacy tokenizer implements Nominatim's traditional algorithms.
Sarah Hoffmann [Fri, 30 Apr 2021 08:08:29 +0000 (10:08 +0200)]
remove support for AUX housenumber tables
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.
This removes the aux tables and all code that references them.
Sarah Hoffmann [Mon, 19 Apr 2021 14:54:22 +0000 (16:54 +0200)]
simplify token precomputation
Rename function to reflect that it is only used for precomputation.
The token IDs are not really needed, so don't bother to compute
the array of tokens.
fix index on location_property_tiger (parent_place_id)
Looks like 2af82975cd968ec09683ae5b16a9aa157a7f2176
accidentally renamed an index. Because of the added "if not
exists" clause, the index doesn't get created. This
significantly slows down reverse queries because they now
require full scans on location_property_tiger.
Without this fix, reverse queries can take 8s on a full
planet install on an r5.8xlarge instance in EC2.
Sarah Hoffmann [Sat, 17 Apr 2021 09:07:04 +0000 (11:07 +0200)]
add support index when continuing import at index phase
Indexing scans the placex table sequentially during indexing
on the initial import. That is okay because we know that all
rows need to be processed anywhere. When continuing the import,
however, a large part might already be indexed, so that the
process spends a lot of time going through rows that are no
longer of interest. Create a supporting index for all unindexed
rows to speed up the scan. This is the same index as used later
for updates.
Sarah Hoffmann [Fri, 9 Apr 2021 19:10:00 +0000 (21:10 +0200)]
simplify name matching between boundary and place node
Instead of normalising the names simply compare them in lower
case. This removes the dependency on the tokenizer for
linking boundaries and nodes. When looking up the linked places
by place type also allow that one name is simply contained in the
other. This catches the frequent case where one of the names has
an addendum (e.g. Newport vs. City of Newport).
Drops the special index for the name lookup and insted relies
on a slightly extended version of the geometry index used for
reverse lookup. Saves around 100MB on a planet.