Sarah Hoffmann [Thu, 12 Aug 2021 09:09:46 +0000 (11:09 +0200)]
php: make word list a first-class object
This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.
Sarah Hoffmann [Thu, 29 Jul 2021 19:25:59 +0000 (21:25 +0200)]
remove country restriction from tokenizer
Restricting tokens due to the search context is better done in
the generic search part instead of repeating the same test in
every tokenizer implementation.
Sarah Hoffmann [Sat, 14 Aug 2021 21:48:06 +0000 (23:48 +0200)]
port multi-region update scripts to nominatim tool
Also updates the documentation. For the simple case of just
importing multiple regions, provide simplified instructions
that use the new multi-file import feature.
Sarah Hoffmann [Sun, 25 Jul 2021 13:08:11 +0000 (15:08 +0200)]
reinstate word column in icu word table
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
Sarah Hoffmann [Sat, 24 Jul 2021 10:12:31 +0000 (12:12 +0200)]
bdd tests: do not query word table directly
The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.
Sarah Hoffmann [Thu, 22 Jul 2021 15:24:43 +0000 (17:24 +0200)]
adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
Sarah Hoffmann [Tue, 20 Jul 2021 08:27:06 +0000 (10:27 +0200)]
new word table layout for icu tokenizer
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.
Sarah Hoffmann [Sat, 17 Jul 2021 18:24:33 +0000 (20:24 +0200)]
move SearchDescription building into tokens
Moving the logic for extending the SearchDescription into the
token classes splits up the code and makes it more readable.
More importantly: it allows tokenizer to define custom token
classes in the future.
Sarah Hoffmann [Thu, 15 Jul 2021 12:48:20 +0000 (14:48 +0200)]
remove Token from explicit input for SearchDescription extension
The token string is only required by the PartialToken type, so
it can simply save the token string internally. No need to pass
it to every type.
Also moves the check for multi-word partials to the token loader
code in the tokenizer. Multi-word partials can only happen with
the legacy tokenizer and when the database was loaded with an
older version of Nominatim. No need to keep the check for
everybody.
Sarah Hoffmann [Thu, 15 Jul 2021 12:12:59 +0000 (14:12 +0200)]
factor out query position
Moves token and phrase position and phrase type into a separate
class that is handed in when assembling the search description.
This drastically reduces the number of parameters for the function
to extend the search descriptions and gives us more flexibility
in the future for more complex positional analysis.
Sarah Hoffmann [Wed, 14 Jul 2021 20:17:17 +0000 (22:17 +0200)]
remove special status of partial tokens
Full-word tokens are no longer marked by a space at the
beginning of the token. Use the new Partial token category
instead. This removes a couple of special casing, we don't
really need.
The word table still has the space for compatibility reasons,
so the tokenizer code needs to get rid of it when loading the
tokens.
Sarah Hoffmann [Fri, 2 Jul 2021 13:05:17 +0000 (15:05 +0200)]
restrict partial word counting to names of reasoanble length
The partial word count does not split names to save a bit of time.
The result is that it might enounter unreasonably long names
which in truth consist of multiple words. No accurate statistics
are needed so simply restrict the count to words shorter than
75 characters.
Sarah Hoffmann [Thu, 1 Jul 2021 15:56:23 +0000 (17:56 +0200)]
fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.
Also forbit composing a word after a space was added in the
end by a previous replacement.
Sarah Hoffmann [Wed, 30 Jun 2021 19:37:29 +0000 (21:37 +0200)]
import abbreviations from OSM Wiki
Replaces the variant rules with a slightly cleaned-up
version of the abbreviation lists at
https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations
Sarah Hoffmann [Sat, 26 Jun 2021 17:38:08 +0000 (19:38 +0200)]
improve normalization
Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.