Sarah Hoffmann [Thu, 23 Sep 2021 14:57:24 +0000 (16:57 +0200)]
icu tokenizer: switch to matching against partial names
When matching address parts from addr:* tags against place names,
the address names where so far converted to full names and compared
those to the place names. This can become problematic with the new
ICU tokenizer once we introduce creation of different variants
depending on the place name context. It wouldn't be clear which
variant to produce to get a match, so we would have to create all of
them. To work around this issue, switch to using the partial terms
for matching. This introduces a larger fuzziness between matches but
that shouldn't be a problem because matching is always geographically
restricted.
The search terms created for address parts have a different problem:
they are already created before we even know if they are going to be
used. This can lead to spurious entries in the word table, which slows
down searching. This problem can also be circumvented by using only
partial terms for the search terms. In terms of searching that means
that the address terms would not get the full-word boost, but given
that the case where an address part does not exist as an OSM object
should be the exception, this is likely acceptable.
Sarah Hoffmann [Wed, 22 Sep 2021 20:20:02 +0000 (22:20 +0200)]
move name matching into tokenizer module
Instead of requesting the match tokens from the tokenizer
when looking for parent streets/places and address parts,
hand in the saved tokens and ask if they match. This gives
the tokenizer more freedom to decide how name matching
should be done.
Sarah Hoffmann [Fri, 24 Sep 2021 15:37:31 +0000 (17:37 +0200)]
adjust address ranks for Spain
Adjusts levels for boundaries according to the list on
https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative
* no admin_level 5, so drop that from addresses
* admin_level 6 has the province
* admin_level 7 has the county when it exists
Also reranks place=province so that it matches up with
admin_level 6 and introduces place=civil_parish which
is used as a place node for some admin_level=9 boundaries
in Galicia.
Sarah Hoffmann [Fri, 3 Sep 2021 16:16:12 +0000 (18:16 +0200)]
introduce generic YAML config loader
Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.
Sarah Hoffmann [Thu, 2 Sep 2021 16:13:45 +0000 (18:13 +0200)]
reduce penalty for special searches by name
Additional penalty for special terms with operator None
should only go to near searches. To reduce the number
of produced searches, restrict the none operator to
appear only in conjunction with the name.
Sarah Hoffmann [Fri, 20 Aug 2021 19:53:13 +0000 (21:53 +0200)]
move linking of places to the preparation stage
Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
Sarah Hoffmann [Tue, 17 Aug 2021 21:11:47 +0000 (23:11 +0200)]
rename legacy_icu tokenizer to icu tokenizer
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
Sarah Hoffmann [Tue, 17 Aug 2021 12:28:55 +0000 (14:28 +0200)]
move special hack for US states to legacy tokenizer
The hack for IL, AL and LA is only needed because these abbreviations
are removed by the legacy tokenizer as a stop word. There is no need
to keep the hack for future tokenizers. Move it therefore to the
token extraction function.
Sarah Hoffmann [Mon, 16 Aug 2021 09:48:25 +0000 (11:48 +0200)]
add mkdocstrings requirement for building docs
mkdocstrings also needs access to the Python sources, so set
a PYTHONPATH accordingly. This makes running mkdocs directly
a bit awkward, therefore add a `make serve-doc` target.
Sarah Hoffmann [Thu, 12 Aug 2021 09:09:46 +0000 (11:09 +0200)]
php: make word list a first-class object
This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.
Sarah Hoffmann [Thu, 29 Jul 2021 19:25:59 +0000 (21:25 +0200)]
remove country restriction from tokenizer
Restricting tokens due to the search context is better done in
the generic search part instead of repeating the same test in
every tokenizer implementation.
Sarah Hoffmann [Sat, 14 Aug 2021 21:48:06 +0000 (23:48 +0200)]
port multi-region update scripts to nominatim tool
Also updates the documentation. For the simple case of just
importing multiple regions, provide simplified instructions
that use the new multi-file import feature.
Sarah Hoffmann [Sun, 25 Jul 2021 13:08:11 +0000 (15:08 +0200)]
reinstate word column in icu word table
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
Sarah Hoffmann [Sat, 24 Jul 2021 10:12:31 +0000 (12:12 +0200)]
bdd tests: do not query word table directly
The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.
Sarah Hoffmann [Thu, 22 Jul 2021 15:24:43 +0000 (17:24 +0200)]
adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
Sarah Hoffmann [Tue, 20 Jul 2021 08:27:06 +0000 (10:27 +0200)]
new word table layout for icu tokenizer
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.