Sarah Hoffmann [Thu, 28 Apr 2022 19:38:00 +0000 (21:38 +0200)]
keep inherited address parts after indexing
The inherited housenumber is needed for display output. We can't
take the one from the housenumber field because it is already
normalized. Remove the inherited address only when reindexing.
Sarah Hoffmann [Thu, 28 Apr 2022 15:20:56 +0000 (17:20 +0200)]
ICU: better letter identification in normalization
The Letter class does not include non-spacing marks that can also
have a consonant or vowel meaning, especially in Indian languages.
Use the alnum propoerty instead which includes them all. Also
include the vowel-canceling Virama, which is not a letter by itself
but changes the transliteration.
Sarah Hoffmann [Fri, 22 Apr 2022 12:32:19 +0000 (14:32 +0200)]
further tweaking of address distance
For point features, keep using the distance to centroid.
For area features, add a tie breaker for the case where the
center point falls on the boundary.
Sarah Hoffmann [Thu, 21 Apr 2022 19:56:59 +0000 (21:56 +0200)]
change distance computation between place and address part
Instead of computing the distance to the centroid of the area
compute the distance of the area to the centroid of the feature.
This means we give preference to the area that covers the centroid.
It's still a heuristics but one that is a bit less random.
Sarah Hoffmann [Sun, 20 Mar 2022 10:31:42 +0000 (11:31 +0100)]
restore the tokenizer directory when missing
Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.
Sarah Hoffmann [Fri, 18 Mar 2022 09:48:53 +0000 (10:48 +0100)]
remove special case for operator names
The OSM data has been sufficiently cleaned up by now that
the operator no longer needs to be considered a name tag.
Use 'brand' as the searchable alternative.
Sarah Hoffmann [Thu, 17 Mar 2022 10:02:02 +0000 (11:02 +0100)]
merge linked names correctly into namedetails
Convert the '_place_*' entries back to normal entries before
returning them in the 'namedetails' section. If the name field is
duplicated, kept the '_place_*' notation. This preserves the previous
behaviour before _place_ names were introduces but adds the additional
names from the linked place for reference.
Sarah Hoffmann [Wed, 16 Mar 2022 15:38:52 +0000 (16:38 +0100)]
save differing linked polace names in extra fields
This keeps the names tracable and ensures that all names are searchable
when they differ. Do not keep names when they are exactly the same
to save some space. Linked names are cleaned out before relinking.
Sarah Hoffmann [Tue, 1 Mar 2022 07:54:15 +0000 (08:54 +0100)]
do not expand records in select list
An expression of the form 'SELECT (func()).*' will be expanded
by Postgresql _before_ execution with the result that the function
will be called as many times as there are fields in the record.
This is not what we want. The function call needs to go into
the FROM clause instead.
Sarah Hoffmann [Wed, 16 Feb 2022 10:15:43 +0000 (11:15 +0100)]
add framework for analysing housenumbers
This lays the groundwork for adding variants for housenumbers.
When analysis is enabled, then the 'word' field in the word table
is used as usual, so that variants can be created. There will be
only one analyser allowed which must have the fixed name
'@housenumber'.
Sarah Hoffmann [Tue, 15 Feb 2022 13:38:03 +0000 (14:38 +0100)]
handle unknown analyzer
When changing something in the default configuration of the sanatizers
that refers to an analyzer that is not yet loaded, there shouldn't be
any errors.
Sarah Hoffmann [Tue, 15 Feb 2022 11:15:18 +0000 (12:15 +0100)]
move generation of normalized token form to analyzer
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.