git.openstreetmap.org Git - nominatim.git/log

]> git.openstreetmap.org Git - nominatim.git/log

Sarah Hoffmann [Tue, 20 Jul 2021 08:27:06 +0000 (10:27 +0200)]

new word table layout for icu tokenizer

The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 28 Jul 2021 09:28:49 +0000 (11:28 +0200)]

fix typos in tokenizer docs

commit | commitdiff | tree

Sarah Hoffmann [Mon, 26 Jul 2021 10:38:56 +0000 (12:38 +0200)]

Merge pull request #2401 from lonvia/port-add-data-to-python

Port add-data functions from PHP to Python

commit | commitdiff | tree

Sarah Hoffmann [Sun, 25 Jul 2021 21:44:22 +0000 (23:44 +0200)]

adapt cli tests to Python port for add-data

commit | commitdiff | tree

Sarah Hoffmann [Sun, 25 Jul 2021 21:30:46 +0000 (23:30 +0200)]

remove unused update script

commit | commitdiff | tree

Sarah Hoffmann [Sun, 25 Jul 2021 21:29:15 +0000 (23:29 +0200)]

replace add-data function with native Python code

commit | commitdiff | tree

Sarah Hoffmann [Sun, 25 Jul 2021 16:14:12 +0000 (18:14 +0200)]

move add-data subcommand into a separate file

commit | commitdiff | tree

Sarah Hoffmann [Tue, 20 Jul 2021 08:08:31 +0000 (10:08 +0200)]

fix parameters for TokenWord creation

commit | commitdiff | tree

Sarah Hoffmann [Mon, 19 Jul 2021 12:28:02 +0000 (14:28 +0200)]

Merge pull request #2397 from lonvia/increase-minimum-required-versions

Increase minimum required PostgreSQL version to 9.5

commit | commitdiff | tree

Sarah Hoffmann [Mon, 19 Jul 2021 08:24:57 +0000 (10:24 +0200)]

remove special code for pre9.5 postgresql

9.5 is now the minimum requirement.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 19 Jul 2021 08:15:32 +0000 (10:15 +0200)]

increase minimum version for PostgreSQL to 9.5

This is the minimum version we can test with the CI.
With 9.5 there is also complete support for jsonb available.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 19 Jul 2021 08:14:14 +0000 (10:14 +0200)]

require Python 3.6 also in CMakeFile

This had been forgotten when increasing the minimum Python version.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 19 Jul 2021 07:42:37 +0000 (09:42 +0200)]

Merge pull request #2396 from lonvia/partial-word-token

Reorganise code that build the SearchDescription

commit | commitdiff | tree

Sarah Hoffmann [Sun, 18 Jul 2021 18:20:22 +0000 (20:20 +0200)]

make all Token menbers private

commit | commitdiff | tree

Sarah Hoffmann [Sun, 18 Jul 2021 14:52:37 +0000 (16:52 +0200)]

merge marking rare name with adding name token

Only name tokens can be rare, so this should be the same
function.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 18 Jul 2021 14:10:42 +0000 (16:10 +0200)]

add documentation for public interface of SearchDescription

commit | commitdiff | tree

Sarah Hoffmann [Sat, 17 Jul 2021 20:01:35 +0000 (22:01 +0200)]

factor out check if a token fits current search

Saves allocating an empty array.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 17 Jul 2021 18:24:33 +0000 (20:24 +0200)]

move SearchDescription building into tokens

Moving the logic for extending the SearchDescription into the
token classes splits up the code and makes it more readable.
More importantly: it allows tokenizer to define custom token
classes in the future.

commit | commitdiff | tree

Sarah Hoffmann [Thu, 15 Jul 2021 12:48:20 +0000 (14:48 +0200)]

remove Token from explicit input for SearchDescription extension

The token string is only required by the PartialToken type, so
it can simply save the token string internally. No need to pass
it to every type.

Also moves the check for multi-word partials to the token loader
code in the tokenizer. Multi-word partials can only happen with
the legacy tokenizer and when the database was loaded with an
older version of Nominatim. No need to keep the check for
everybody.

commit | commitdiff | tree

Sarah Hoffmann [Thu, 15 Jul 2021 12:12:59 +0000 (14:12 +0200)]

factor out query position

Moves token and phrase position and phrase type into a separate
class that is handed in when assembling the search description.
This drastically reduces the number of parameters for the function
to extend the search descriptions and gives us more flexibility
in the future for more complex positional analysis.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 14 Jul 2021 20:17:17 +0000 (22:17 +0200)]

remove special status of partial tokens

Full-word tokens are no longer marked by a space at the
beginning of the token. Use the new Partial token category
instead. This removes a couple of special casing, we don't
really need.

The word table still has the space for compatibility reasons,
so the tokenizer code needs to get rid of it when loading the
tokens.

commit | commitdiff | tree

Sarah Hoffmann [Tue, 13 Jul 2021 14:54:51 +0000 (16:54 +0200)]

introduce a separate token type for partials

This means that the leading space can be removed as a partial
word indicator.

commit | commitdiff | tree

Sarah Hoffmann [Tue, 13 Jul 2021 14:46:12 +0000 (16:46 +0200)]

Merge pull request #2393 from lonvia/fix-flake8-issues

Fix flake8 issues

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 20:05:22 +0000 (22:05 +0200)]

use psycopg's SQL quoting where possible

Use the SQL formatting supplied with psycopg whenever the
query needs to be put together from snippets.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 19:08:20 +0000 (21:08 +0200)]

add helper function for execute_values

Make psycopg2's convenience function accessible through
the cursor.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 18:32:46 +0000 (20:32 +0200)]

provide wrapper function for DROP TABLE

Use psycopg2 formatting to ensure correct quoting.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 15:45:42 +0000 (17:45 +0200)]

more formatting fixes

Found by flake8.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 15:14:59 +0000 (17:14 +0200)]

Merge pull request #2391 from lonvia/fix-sonar-issues

Fix bugs and code smells found by Sonarqube

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 12:58:44 +0000 (14:58 +0200)]

factor out connection reset code

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 12:47:50 +0000 (14:47 +0200)]

simplify analyse function

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 12:43:50 +0000 (14:43 +0200)]

split up variant computation for better readability

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 09:53:25 +0000 (11:53 +0200)]

reorganise process_place function

Move address processing into its own function as it is
rather extensive.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 09:41:05 +0000 (11:41 +0200)]

simplify website setup code

Use formaat strings and move variable quoting code into extra
function.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 12 Jul 2021 09:33:09 +0000 (11:33 +0200)]

avoid repeated patterns for table name

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 22:16:25 +0000 (00:16 +0200)]

simplify if statements

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 21:48:16 +0000 (23:48 +0200)]

convert single case switch to if statement

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 21:22:16 +0000 (23:22 +0200)]

avoid local variable assignment

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 18:21:12 +0000 (20:21 +0200)]

fix more missing braces on one-liners

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 18:14:25 +0000 (20:14 +0200)]

remove dead code

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 18:10:13 +0000 (20:10 +0200)]

do not intermix params with and without default

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 17:24:04 +0000 (19:24 +0200)]

directly return data in function

The temporary variable is not necessary.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 17:11:37 +0000 (19:11 +0200)]

remove unnecessayly nested ifs

Found by Sonarqube.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 17:10:04 +0000 (19:10 +0200)]

remove unused functions

The functions were necessary for the transitory code
to Python and are no longer used.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 11 Jul 2021 16:23:42 +0000 (18:23 +0200)]

avoid multiple returns of same value

Found by Sonarqube.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 10 Jul 2021 12:59:38 +0000 (14:59 +0200)]

always use brackets on if statements

This adds bracket around all one-line if statements that did
not have them yet.

commit | commitdiff | tree

Sarah Hoffmann [Fri, 9 Jul 2021 14:36:42 +0000 (16:36 +0200)]

remove unused variables

As reported by sonarqube.

commit | commitdiff | tree

Sarah Hoffmann [Fri, 9 Jul 2021 10:50:35 +0000 (12:50 +0200)]

fix bad use of echo in PHP output

commit | commitdiff | tree

Sarah Hoffmann [Fri, 9 Jul 2021 10:32:37 +0000 (12:32 +0200)]

Merge pull request #2390 from lonvia/responsible-disclosure

Add security issue disclosure policy

commit | commitdiff | tree

Sarah Hoffmann [Fri, 9 Jul 2021 09:36:59 +0000 (11:36 +0200)]

add security issue disclosure policy

commit | commitdiff | tree

Sarah Hoffmann [Wed, 7 Jul 2021 12:39:53 +0000 (14:39 +0200)]

Merge pull request #2384 from lonvia/actions-add-icu-tokenizer

CI: run tests on Ubuntu 18

commit | commitdiff | tree

Sarah Hoffmann [Tue, 6 Jul 2021 21:04:01 +0000 (23:04 +0200)]

add missing pyyaml requirement

commit | commitdiff | tree

Sarah Hoffmann [Tue, 6 Jul 2021 20:52:57 +0000 (22:52 +0200)]

enable PHP 7.2 for Ubuntu 18 CI

commit | commitdiff | tree

Sarah Hoffmann [Tue, 6 Jul 2021 14:10:18 +0000 (16:10 +0200)]

cannot use capture_output in subprocess.run

Only available since Python 3.7.

commit | commitdiff | tree

Sarah Hoffmann [Tue, 6 Jul 2021 07:54:11 +0000 (09:54 +0200)]

remove default parameter for namedtuple

This is only available in Python 3.7.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 15:15:07 +0000 (17:15 +0200)]

CI: run tests on older Ubuntu version as well

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 10:34:34 +0000 (12:34 +0200)]

Merge pull request #2382 from lonvia/remove-json-config

Remove outdated ICU tokenizer JSON config

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 10:34:16 +0000 (12:34 +0200)]

Merge pull request #2383 from lonvia/remove-more-names

Exclude name:etymology and name:signed

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 09:04:16 +0000 (11:04 +0200)]

exclude name:etymology and name:signed

name:etymology contains a description of the name origin and is
thus more informative than search-worthy.

name:signed basically indicates that the feature does not have
a name.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 09:01:35 +0000 (11:01 +0200)]

remove outdated ICU tokenizer JSON config

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 08:32:38 +0000 (10:32 +0200)]

Merge pull request #2371 from lonvia/increase-python-version

Increase minimum required Python version to 3.6

commit | commitdiff | tree

Sarah Hoffmann [Mon, 5 Jul 2021 08:32:16 +0000 (10:32 +0200)]

Merge pull request #2381 from lonvia/reorganise-abbreviations

Reorganise abbreviation handling

commit | commitdiff | tree

Sarah Hoffmann [Sun, 4 Jul 2021 08:44:58 +0000 (10:44 +0200)]

add warning about experimental nature of ICU tokenizer

commit | commitdiff | tree

Sarah Hoffmann [Fri, 2 Jul 2021 14:42:13 +0000 (16:42 +0200)]

limit the number of variants that can be produced

commit | commitdiff | tree

Sarah Hoffmann [Fri, 2 Jul 2021 13:05:17 +0000 (15:05 +0200)]

restrict partial word counting to names of reasoanble length

The partial word count does not split names to save a bit of time.
The result is that it might enounter unreasonably long names
which in truth consist of multiple words. No accurate statistics
are needed so simply restrict the count to words shorter than
75 characters.

commit | commitdiff | tree

Sarah Hoffmann [Thu, 1 Jul 2021 15:56:23 +0000 (17:56 +0200)]

fix subsequent replacements

Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 30 Jun 2021 19:52:33 +0000 (21:52 +0200)]

leave ICU variant properties empty for now

Saving unused properties causes unnecessary duplicates.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 30 Jun 2021 19:37:29 +0000 (21:37 +0200)]

import abbreviations from OSM Wiki

Replaces the variant rules with a slightly cleaned-up
version of the abbreviation lists at
https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 17:38:08 +0000 (19:38 +0200)]

improve normalization

Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 09:57:09 +0000 (11:57 +0200)]

only consider partials in multi-words for initial count

This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 08:13:33 +0000 (10:13 +0200)]

add documentation for ICU tokenizer configuration

commit | commitdiff | tree

Sarah Hoffmann [Thu, 24 Jun 2021 18:02:07 +0000 (20:02 +0200)]

switch to a more flexible variant description format

The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 20 Jun 2021 21:45:33 +0000 (23:45 +0200)]

use yaml tag syntax to mark include files

commit | commitdiff | tree

Sarah Hoffmann [Tue, 15 Jun 2021 07:02:17 +0000 (09:02 +0200)]

add dependency on datrie

commit | commitdiff | tree

Sarah Hoffmann [Tue, 15 Jun 2021 06:59:03 +0000 (08:59 +0200)]

tests for composing decomposed suffixes

commit | commitdiff | tree

Sarah Hoffmann [Fri, 11 Jun 2021 08:03:31 +0000 (10:03 +0200)]

make compund decomposition pure import feature

Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.

commit | commitdiff | tree

Sarah Hoffmann [Thu, 10 Jun 2021 15:18:23 +0000 (17:18 +0200)]

complete tests for icu tokenizer

commit | commitdiff | tree

Sarah Hoffmann [Thu, 10 Jun 2021 08:28:46 +0000 (10:28 +0200)]

fix full term token in special phrases

commit | commitdiff | tree

Sarah Hoffmann [Thu, 10 Jun 2021 08:06:49 +0000 (10:06 +0200)]

complete tests for rule loader

commit | commitdiff | tree

Sarah Hoffmann [Thu, 10 Jun 2021 07:36:43 +0000 (09:36 +0200)]

correctly quote strings when copying in data

Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 9 Jun 2021 13:07:36 +0000 (15:07 +0200)]

update unit tests for adapted abbreviation code

commit | commitdiff | tree

Sarah Hoffmann [Wed, 9 Jun 2021 08:53:39 +0000 (10:53 +0200)]

add abbreviations from legacy tokenizer

These abbreviations are not a perfect fit anymore because
abbreviation replacement is now applied before transliteration.

commit | commitdiff | tree

Sarah Hoffmann [Sun, 6 Jun 2021 09:00:44 +0000 (11:00 +0200)]

adapt tests for ICU tokenizer

commit | commitdiff | tree

Sarah Hoffmann [Fri, 28 May 2021 20:06:13 +0000 (22:06 +0200)]

move abbreviation computation into import phase

This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 26 May 2021 18:50:34 +0000 (20:50 +0200)]

icu tokenizer: move transliteration rules in separate file

The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 3 Jul 2021 19:14:43 +0000 (21:14 +0200)]

docs: nominatim-ui should be installed from the release

The development version does not provide the pre-packaged
dist directory anymore.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 14:21:08 +0000 (16:21 +0200)]

Merge pull request #2373 from lonvia/tweak-search-cost

Further tweaking of search cost

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 09:20:25 +0000 (11:20 +0200)]

remove penalty for full words in address

Now that mutli-word partials no longer exist, multi-word full
words need to be used to search in addresses and therefore no
longer should have a penalty.

Also changes the condition when a full word is included into
the address. It is no longer relevant if an equivalent partial
exists but only if the term consists of more than one word.

commit | commitdiff | tree

Sarah Hoffmann [Sat, 26 Jun 2021 08:31:55 +0000 (10:31 +0200)]

adjust penalty for housenumber-in-name searches

When searching for house numbers in the name (for place-only
terms) then the same penalties need to apply as for the
regular house number search.

Change the code to first compute the penalties and then create
the new search variants.

commit | commitdiff | tree

Sarah Hoffmann [Mon, 21 Jun 2021 14:32:54 +0000 (16:32 +0200)]

increase minimum Python to 3.6

Python 3.6 introduces formatted string literals and
flag enums as well as a much faster dict implementation.
These changes make the code so much simpler as to warrant
dropping Python 3.5 support.

Affected distributions are Ubuntu 16.04 and Debian Stretch.

commit | commitdiff | tree

Sarah Hoffmann [Fri, 18 Jun 2021 08:58:41 +0000 (10:58 +0200)]

make sure old data gets deleted on place type change

When changing from some other place type to place=postcode
make sure that the old place type entry in the place table
is deleted.

commit | commitdiff | tree

Sarah Hoffmann [Thu, 17 Jun 2021 22:28:10 +0000 (00:28 +0200)]

update postcode in place if it already exists

commit | commitdiff | tree

Sarah Hoffmann [Thu, 17 Jun 2021 13:30:05 +0000 (15:30 +0200)]

Merge pull request #2369 from lonvia/exclude-poi-from-housenumber-search

Do not return POIs when dropping house number in query

commit | commitdiff | tree

Sarah Hoffmann [Thu, 17 Jun 2021 10:05:33 +0000 (12:05 +0200)]

do not return POIs when dropping house number in query

We've previously added searching through rank 30 in a house
number search to enable searches for house number+name.
This had the unintended side effect that rank 30 objects
are also returned in s search that dropped the house number
from the query. This is wrong because POIs cannot function
as a parent to a house number.

This fix drops all rank 30 objects from the results for a
house number search if they do not match the requested house
number.

commit | commitdiff | tree

Sarah Hoffmann [Wed, 16 Jun 2021 09:45:07 +0000 (11:45 +0200)]

Merge pull request #2360 from AntoJvlt/postcodes-place-table

Use place instead of placex to compute postcodes

commit | commitdiff | tree

AntoJvlt [Sat, 12 Jun 2021 13:46:08 +0000 (15:46 +0200)]

Improved performance of the postcodes query and some code cleaning

commit | commitdiff | tree

AntoJvlt [Sat, 12 Jun 2021 13:35:51 +0000 (15:35 +0200)]

Always delete old placex entry for type=postcode when inserting a new one into the place table

commit | commitdiff | tree

AntoJvlt [Wed, 9 Jun 2021 07:24:25 +0000 (09:24 +0200)]

Handle postcode type change in place insert trigger