Sarah Hoffmann [Thu, 22 Jul 2021 15:24:43 +0000 (17:24 +0200)]
adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
Sarah Hoffmann [Tue, 20 Jul 2021 08:27:06 +0000 (10:27 +0200)]
new word table layout for icu tokenizer
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.
Sarah Hoffmann [Sat, 17 Jul 2021 18:24:33 +0000 (20:24 +0200)]
move SearchDescription building into tokens
Moving the logic for extending the SearchDescription into the
token classes splits up the code and makes it more readable.
More importantly: it allows tokenizer to define custom token
classes in the future.
Sarah Hoffmann [Thu, 15 Jul 2021 12:48:20 +0000 (14:48 +0200)]
remove Token from explicit input for SearchDescription extension
The token string is only required by the PartialToken type, so
it can simply save the token string internally. No need to pass
it to every type.
Also moves the check for multi-word partials to the token loader
code in the tokenizer. Multi-word partials can only happen with
the legacy tokenizer and when the database was loaded with an
older version of Nominatim. No need to keep the check for
everybody.
Sarah Hoffmann [Thu, 15 Jul 2021 12:12:59 +0000 (14:12 +0200)]
factor out query position
Moves token and phrase position and phrase type into a separate
class that is handed in when assembling the search description.
This drastically reduces the number of parameters for the function
to extend the search descriptions and gives us more flexibility
in the future for more complex positional analysis.
Sarah Hoffmann [Wed, 14 Jul 2021 20:17:17 +0000 (22:17 +0200)]
remove special status of partial tokens
Full-word tokens are no longer marked by a space at the
beginning of the token. Use the new Partial token category
instead. This removes a couple of special casing, we don't
really need.
The word table still has the space for compatibility reasons,
so the tokenizer code needs to get rid of it when loading the
tokens.
Sarah Hoffmann [Fri, 2 Jul 2021 13:05:17 +0000 (15:05 +0200)]
restrict partial word counting to names of reasoanble length
The partial word count does not split names to save a bit of time.
The result is that it might enounter unreasonably long names
which in truth consist of multiple words. No accurate statistics
are needed so simply restrict the count to words shorter than
75 characters.
Sarah Hoffmann [Thu, 1 Jul 2021 15:56:23 +0000 (17:56 +0200)]
fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.
Also forbit composing a word after a space was added in the
end by a previous replacement.
Sarah Hoffmann [Wed, 30 Jun 2021 19:37:29 +0000 (21:37 +0200)]
import abbreviations from OSM Wiki
Replaces the variant rules with a slightly cleaned-up
version of the abbreviation lists at
https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations
Sarah Hoffmann [Sat, 26 Jun 2021 17:38:08 +0000 (19:38 +0200)]
improve normalization
Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.
Sarah Hoffmann [Thu, 24 Jun 2021 18:02:07 +0000 (20:02 +0200)]
switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
Sarah Hoffmann [Fri, 11 Jun 2021 08:03:31 +0000 (10:03 +0200)]
make compund decomposition pure import feature
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
Sarah Hoffmann [Fri, 28 May 2021 20:06:13 +0000 (22:06 +0200)]
move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.
Sarah Hoffmann [Wed, 26 May 2021 18:50:34 +0000 (20:50 +0200)]
icu tokenizer: move transliteration rules in separate file
The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.
Sarah Hoffmann [Sat, 26 Jun 2021 09:20:25 +0000 (11:20 +0200)]
remove penalty for full words in address
Now that mutli-word partials no longer exist, multi-word full
words need to be used to search in addresses and therefore no
longer should have a penalty.
Also changes the condition when a full word is included into
the address. It is no longer relevant if an equivalent partial
exists but only if the term consists of more than one word.
Sarah Hoffmann [Mon, 21 Jun 2021 14:32:54 +0000 (16:32 +0200)]
increase minimum Python to 3.6
Python 3.6 introduces formatted string literals and
flag enums as well as a much faster dict implementation.
These changes make the code so much simpler as to warrant
dropping Python 3.5 support.
Affected distributions are Ubuntu 16.04 and Debian Stretch.