]> git.openstreetmap.org Git - nominatim.git/log
nominatim.git
3 years agodefine formal public Python interface for tokenizer
Sarah Hoffmann [Tue, 10 Aug 2021 12:51:35 +0000 (14:51 +0200)]
define formal public Python interface for tokenizer

This introduces an abstract class for the Tokenizer/Analyzer
for documentation purposes.

3 years agodocs: querying and tokenizers
Sarah Hoffmann [Sat, 31 Jul 2021 07:49:29 +0000 (09:49 +0200)]
docs: querying and tokenizers

3 years agodocs: add developer doc page for Tokenizer
Sarah Hoffmann [Thu, 29 Jul 2021 18:54:33 +0000 (20:54 +0200)]
docs: add developer doc page for Tokenizer

3 years agoMerge pull request #2424 from lonvia/multi-country-import
Sarah Hoffmann [Mon, 16 Aug 2021 06:48:28 +0000 (08:48 +0200)]
Merge pull request #2424 from lonvia/multi-country-import

Update instructions for importing multiple regions

3 years agoMerge pull request #2423 from hummeltech/patch-1
Sarah Hoffmann [Sun, 15 Aug 2021 20:00:50 +0000 (22:00 +0200)]
Merge pull request #2423 from hummeltech/patch-1

Fix old paths for `phpcs` when using `make test`

3 years agoignore words without id for status
Sarah Hoffmann [Sun, 15 Aug 2021 15:49:22 +0000 (17:49 +0200)]
ignore words without id for status

3 years agosplit up large setup function
Sarah Hoffmann [Sun, 15 Aug 2021 10:24:13 +0000 (12:24 +0200)]
split up large setup function

3 years agoport multi-region update scripts to nominatim tool
Sarah Hoffmann [Sat, 14 Aug 2021 21:48:06 +0000 (23:48 +0200)]
port multi-region update scripts to nominatim tool

Also updates the documentation. For the simple case of just
importing multiple regions, provide simplified instructions
that use the new multi-file import feature.

Fixes #2365.

3 years agoupdate osm2pgsql to 1.5.1
Sarah Hoffmann [Sat, 14 Aug 2021 20:46:35 +0000 (22:46 +0200)]
update osm2pgsql to 1.5.1

3 years agoallow multiple files for the import command
Sarah Hoffmann [Sat, 14 Aug 2021 19:42:21 +0000 (21:42 +0200)]
allow multiple files for the import command

The files are forwarded to osm2pgsql which is now able to merge
them correctly.

3 years agoFix old paths for `phpcs` when using `make test`
David Hummel [Thu, 12 Aug 2021 20:34:18 +0000 (13:34 -0700)]
Fix old paths for `phpcs` when using `make test`

These paths no longer exist since db3ced17bbfff00411f506d8c84419c875959d5e, they are now all located under `lib-php`

3 years agoMerge pull request #2413 from osm-search/helm-chart
Sarah Hoffmann [Sun, 8 Aug 2021 09:09:36 +0000 (11:09 +0200)]
Merge pull request #2413 from osm-search/helm-chart

Installation docs - link to Kubernetes install project

3 years agoInstallation docs - link to Kubernetes install project
mtmail [Tue, 3 Aug 2021 10:02:35 +0000 (12:02 +0200)]
Installation docs - link to Kubernetes install project

As reported by @robjuz in https://github.com/osm-search/Nominatim/discussions/2412

3 years agoMerge pull request #2408 from lonvia/icu-change-word-table-layout
Sarah Hoffmann [Wed, 28 Jul 2021 12:28:49 +0000 (14:28 +0200)]
Merge pull request #2408 from lonvia/icu-change-word-table-layout

Change table layout of word table for ICU tokenizer

3 years agophp: force use of global Exception class
Sarah Hoffmann [Sun, 25 Jul 2021 14:29:04 +0000 (16:29 +0200)]
php: force use of global Exception class

3 years agofix Python linitin errors
Sarah Hoffmann [Sun, 25 Jul 2021 13:30:47 +0000 (15:30 +0200)]
fix Python linitin errors

3 years agofix linitin issues in PHP
Sarah Hoffmann [Sun, 25 Jul 2021 13:13:49 +0000 (15:13 +0200)]
fix linitin issues in PHP

3 years agoreinstate word column in icu word table
Sarah Hoffmann [Sun, 25 Jul 2021 13:08:11 +0000 (15:08 +0200)]
reinstate word column in icu word table

Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.

3 years agobdd tests: do not query word table directly
Sarah Hoffmann [Sat, 24 Jul 2021 10:12:31 +0000 (12:12 +0200)]
bdd tests: do not query word table directly

The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.

3 years agoadapt unit test for new word table
Sarah Hoffmann [Thu, 22 Jul 2021 15:24:43 +0000 (17:24 +0200)]
adapt unit test for new word table

Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.

3 years agoconvert word info column to json before copying
Sarah Hoffmann [Wed, 21 Jul 2021 09:37:14 +0000 (11:37 +0200)]
convert word info column to json before copying

3 years agoadapt special terms lookup to new word table
Sarah Hoffmann [Wed, 21 Jul 2021 08:52:34 +0000 (10:52 +0200)]
adapt special terms lookup to new word table

3 years agoswitch word tokens to new word table layout
Sarah Hoffmann [Wed, 21 Jul 2021 08:41:38 +0000 (10:41 +0200)]
switch word tokens to new word table layout

3 years agoswitch special phrases to new word table format
Sarah Hoffmann [Tue, 20 Jul 2021 19:11:01 +0000 (21:11 +0200)]
switch special phrases to new word table format

3 years agoswitch postcode tokens to new word table layout
Sarah Hoffmann [Tue, 20 Jul 2021 10:11:12 +0000 (12:11 +0200)]
switch postcode tokens to new word table layout

3 years agoswitch housenumber tokens to new word table layout
Sarah Hoffmann [Tue, 20 Jul 2021 09:36:20 +0000 (11:36 +0200)]
switch housenumber tokens to new word table layout

3 years agoswitch country name tokens to new word table layout
Sarah Hoffmann [Tue, 20 Jul 2021 09:21:13 +0000 (11:21 +0200)]
switch country name tokens to new word table layout

3 years agonew word table layout for icu tokenizer
Sarah Hoffmann [Tue, 20 Jul 2021 08:27:06 +0000 (10:27 +0200)]
new word table layout for icu tokenizer

The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.

3 years agofix typos in tokenizer docs
Sarah Hoffmann [Wed, 28 Jul 2021 09:28:49 +0000 (11:28 +0200)]
fix typos in tokenizer docs

3 years agoMerge pull request #2401 from lonvia/port-add-data-to-python
Sarah Hoffmann [Mon, 26 Jul 2021 10:38:56 +0000 (12:38 +0200)]
Merge pull request #2401 from lonvia/port-add-data-to-python

Port add-data functions from PHP to Python

3 years agoadapt cli tests to Python port for add-data
Sarah Hoffmann [Sun, 25 Jul 2021 21:44:22 +0000 (23:44 +0200)]
adapt cli tests to Python port for add-data

3 years agoremove unused update script
Sarah Hoffmann [Sun, 25 Jul 2021 21:30:46 +0000 (23:30 +0200)]
remove unused update script

3 years agoreplace add-data function with native Python code
Sarah Hoffmann [Sun, 25 Jul 2021 21:29:15 +0000 (23:29 +0200)]
replace add-data function with native Python code

3 years agomove add-data subcommand into a separate file
Sarah Hoffmann [Sun, 25 Jul 2021 16:14:12 +0000 (18:14 +0200)]
move add-data subcommand into a separate file

3 years agofix parameters for TokenWord creation
Sarah Hoffmann [Tue, 20 Jul 2021 08:08:31 +0000 (10:08 +0200)]
fix parameters for TokenWord creation

3 years agoMerge pull request #2397 from lonvia/increase-minimum-required-versions
Sarah Hoffmann [Mon, 19 Jul 2021 12:28:02 +0000 (14:28 +0200)]
Merge pull request #2397 from lonvia/increase-minimum-required-versions

Increase minimum required PostgreSQL version to 9.5

3 years agoremove special code for pre9.5 postgresql
Sarah Hoffmann [Mon, 19 Jul 2021 08:24:57 +0000 (10:24 +0200)]
remove special code for pre9.5 postgresql

9.5 is now the minimum requirement.

3 years agoincrease minimum version for PostgreSQL to 9.5
Sarah Hoffmann [Mon, 19 Jul 2021 08:15:32 +0000 (10:15 +0200)]
increase minimum version for PostgreSQL to 9.5

This is the minimum version we can test with the CI.
With 9.5 there is also complete support for jsonb available.

3 years agorequire Python 3.6 also in CMakeFile
Sarah Hoffmann [Mon, 19 Jul 2021 08:14:14 +0000 (10:14 +0200)]
require Python 3.6 also in CMakeFile

This had been forgotten when increasing the minimum Python version.

3 years agoMerge pull request #2396 from lonvia/partial-word-token
Sarah Hoffmann [Mon, 19 Jul 2021 07:42:37 +0000 (09:42 +0200)]
Merge pull request #2396 from lonvia/partial-word-token

Reorganise code that build the SearchDescription

3 years agomake all Token menbers private
Sarah Hoffmann [Sun, 18 Jul 2021 18:20:22 +0000 (20:20 +0200)]
make all Token menbers private

3 years agomerge marking rare name with adding name token
Sarah Hoffmann [Sun, 18 Jul 2021 14:52:37 +0000 (16:52 +0200)]
merge marking rare name with adding name token

Only name tokens can be rare, so this should be the same
function.

3 years agoadd documentation for public interface of SearchDescription
Sarah Hoffmann [Sun, 18 Jul 2021 14:10:42 +0000 (16:10 +0200)]
add documentation for public interface of SearchDescription

3 years agofactor out check if a token fits current search
Sarah Hoffmann [Sat, 17 Jul 2021 20:01:35 +0000 (22:01 +0200)]
factor out check if a token fits current search

Saves allocating an empty array.

3 years agomove SearchDescription building into tokens
Sarah Hoffmann [Sat, 17 Jul 2021 18:24:33 +0000 (20:24 +0200)]
move SearchDescription building into tokens

Moving the logic for extending the SearchDescription into the
token classes splits up the code and makes it more readable.
More importantly: it allows tokenizer to define custom token
classes in the future.

3 years agoremove Token from explicit input for SearchDescription extension
Sarah Hoffmann [Thu, 15 Jul 2021 12:48:20 +0000 (14:48 +0200)]
remove Token from explicit input for SearchDescription extension

The token string is only required by the PartialToken type, so
it can simply save the token string internally. No need to pass
it to every type.

Also moves the check for multi-word partials to the token loader
code in the tokenizer. Multi-word partials can only happen with
the legacy tokenizer and when the database was loaded with an
older version of Nominatim. No need to keep the check for
everybody.

3 years agofactor out query position
Sarah Hoffmann [Thu, 15 Jul 2021 12:12:59 +0000 (14:12 +0200)]
factor out query position

Moves token and phrase position and phrase type into a separate
class that is handed in when assembling the search description.
This drastically reduces the number of parameters for the function
to extend the search descriptions and gives us more flexibility
in the future for more complex positional analysis.

3 years agoremove special status of partial tokens
Sarah Hoffmann [Wed, 14 Jul 2021 20:17:17 +0000 (22:17 +0200)]
remove special status of partial tokens

Full-word tokens are no longer marked by a space at the
beginning of the token. Use the new Partial token category
instead. This removes a couple of special casing, we don't
really need.

The word table still has the space for compatibility reasons,
so the tokenizer code needs to get rid of it when loading the
tokens.

3 years agointroduce a separate token type for partials
Sarah Hoffmann [Tue, 13 Jul 2021 14:54:51 +0000 (16:54 +0200)]
introduce a separate token type for partials

This means that the leading space can be removed as a partial
word indicator.

3 years agoMerge pull request #2393 from lonvia/fix-flake8-issues
Sarah Hoffmann [Tue, 13 Jul 2021 14:46:12 +0000 (16:46 +0200)]
Merge pull request #2393 from lonvia/fix-flake8-issues

Fix flake8 issues

3 years agouse psycopg's SQL quoting where possible
Sarah Hoffmann [Mon, 12 Jul 2021 20:05:22 +0000 (22:05 +0200)]
use psycopg's SQL quoting where possible

Use the SQL formatting supplied with psycopg whenever the
query needs to be put together from snippets.

3 years agoadd helper function for execute_values
Sarah Hoffmann [Mon, 12 Jul 2021 19:08:20 +0000 (21:08 +0200)]
add helper function for execute_values

Make psycopg2's convenience function accessible through
the cursor.

3 years agoprovide wrapper function for DROP TABLE
Sarah Hoffmann [Mon, 12 Jul 2021 18:32:46 +0000 (20:32 +0200)]
provide wrapper function for DROP TABLE

Use psycopg2 formatting to ensure correct quoting.

3 years agomore formatting fixes
Sarah Hoffmann [Mon, 12 Jul 2021 15:45:42 +0000 (17:45 +0200)]
more formatting fixes

Found by flake8.

3 years agoMerge pull request #2391 from lonvia/fix-sonar-issues
Sarah Hoffmann [Mon, 12 Jul 2021 15:14:59 +0000 (17:14 +0200)]
Merge pull request #2391 from lonvia/fix-sonar-issues

Fix bugs and code smells found by Sonarqube

3 years agofactor out connection reset code
Sarah Hoffmann [Mon, 12 Jul 2021 12:58:44 +0000 (14:58 +0200)]
factor out connection reset code

3 years agosimplify analyse function
Sarah Hoffmann [Mon, 12 Jul 2021 12:47:50 +0000 (14:47 +0200)]
simplify analyse function

3 years agosplit up variant computation for better readability
Sarah Hoffmann [Mon, 12 Jul 2021 12:43:50 +0000 (14:43 +0200)]
split up variant computation for better readability

3 years agoreorganise process_place function
Sarah Hoffmann [Mon, 12 Jul 2021 09:53:25 +0000 (11:53 +0200)]
reorganise process_place function

Move address processing into its own function as it is
rather extensive.

3 years agosimplify website setup code
Sarah Hoffmann [Mon, 12 Jul 2021 09:41:05 +0000 (11:41 +0200)]
simplify website setup code

Use formaat strings and move variable quoting code into extra
function.

3 years agoavoid repeated patterns for table name
Sarah Hoffmann [Mon, 12 Jul 2021 09:33:09 +0000 (11:33 +0200)]
avoid repeated patterns for table name

3 years agosimplify if statements
Sarah Hoffmann [Sun, 11 Jul 2021 22:16:25 +0000 (00:16 +0200)]
simplify if statements

3 years agoconvert single case switch to if statement
Sarah Hoffmann [Sun, 11 Jul 2021 21:48:16 +0000 (23:48 +0200)]
convert single case switch to if statement

3 years agoavoid local variable assignment
Sarah Hoffmann [Sun, 11 Jul 2021 21:22:16 +0000 (23:22 +0200)]
avoid local variable assignment

3 years agofix more missing braces on one-liners
Sarah Hoffmann [Sun, 11 Jul 2021 18:21:12 +0000 (20:21 +0200)]
fix more missing braces on one-liners

3 years agoremove dead code
Sarah Hoffmann [Sun, 11 Jul 2021 18:14:25 +0000 (20:14 +0200)]
remove dead code

3 years agodo not intermix params with and without default
Sarah Hoffmann [Sun, 11 Jul 2021 18:10:13 +0000 (20:10 +0200)]
do not intermix params with and without default

3 years agodirectly return data in function
Sarah Hoffmann [Sun, 11 Jul 2021 17:24:04 +0000 (19:24 +0200)]
directly return data in function

The temporary variable is not necessary.

3 years agoremove unnecessayly nested ifs
Sarah Hoffmann [Sun, 11 Jul 2021 17:11:37 +0000 (19:11 +0200)]
remove unnecessayly nested ifs

Found by Sonarqube.

3 years agoremove unused functions
Sarah Hoffmann [Sun, 11 Jul 2021 17:10:04 +0000 (19:10 +0200)]
remove unused functions

The functions were necessary for the transitory code
to Python and are no longer used.

3 years agoavoid multiple returns of same value
Sarah Hoffmann [Sun, 11 Jul 2021 16:23:42 +0000 (18:23 +0200)]
avoid multiple returns of same value

Found by Sonarqube.

3 years agoalways use brackets on if statements
Sarah Hoffmann [Sat, 10 Jul 2021 12:59:38 +0000 (14:59 +0200)]
always use brackets on if statements

This adds bracket around all one-line if statements that did
not have them yet.

3 years agoremove unused variables
Sarah Hoffmann [Fri, 9 Jul 2021 14:36:42 +0000 (16:36 +0200)]
remove unused variables

As reported by sonarqube.

3 years agofix bad use of echo in PHP output
Sarah Hoffmann [Fri, 9 Jul 2021 10:50:35 +0000 (12:50 +0200)]
fix bad use of echo in PHP output

3 years agoMerge pull request #2390 from lonvia/responsible-disclosure
Sarah Hoffmann [Fri, 9 Jul 2021 10:32:37 +0000 (12:32 +0200)]
Merge pull request #2390 from lonvia/responsible-disclosure

Add security issue disclosure policy

3 years agoadd security issue disclosure policy
Sarah Hoffmann [Fri, 9 Jul 2021 09:36:59 +0000 (11:36 +0200)]
add security issue disclosure policy

3 years agoMerge pull request #2384 from lonvia/actions-add-icu-tokenizer
Sarah Hoffmann [Wed, 7 Jul 2021 12:39:53 +0000 (14:39 +0200)]
Merge pull request #2384 from lonvia/actions-add-icu-tokenizer

CI: run tests on Ubuntu 18

3 years agoadd missing pyyaml requirement
Sarah Hoffmann [Tue, 6 Jul 2021 21:04:01 +0000 (23:04 +0200)]
add missing pyyaml requirement

3 years agoenable PHP 7.2 for Ubuntu 18 CI
Sarah Hoffmann [Tue, 6 Jul 2021 20:52:57 +0000 (22:52 +0200)]
enable PHP 7.2 for Ubuntu 18 CI

3 years agocannot use capture_output in subprocess.run
Sarah Hoffmann [Tue, 6 Jul 2021 14:10:18 +0000 (16:10 +0200)]
cannot use capture_output in subprocess.run

Only available since Python 3.7.

3 years agoremove default parameter for namedtuple
Sarah Hoffmann [Tue, 6 Jul 2021 07:54:11 +0000 (09:54 +0200)]
remove default parameter for namedtuple

This is only available in Python 3.7.

3 years agoCI: run tests on older Ubuntu version as well
Sarah Hoffmann [Mon, 5 Jul 2021 15:15:07 +0000 (17:15 +0200)]
CI: run tests on older Ubuntu version as well

3 years agoMerge pull request #2382 from lonvia/remove-json-config
Sarah Hoffmann [Mon, 5 Jul 2021 10:34:34 +0000 (12:34 +0200)]
Merge pull request #2382 from lonvia/remove-json-config

Remove outdated ICU tokenizer JSON config

3 years agoMerge pull request #2383 from lonvia/remove-more-names
Sarah Hoffmann [Mon, 5 Jul 2021 10:34:16 +0000 (12:34 +0200)]
Merge pull request #2383 from lonvia/remove-more-names

Exclude name:etymology and name:signed

3 years agoexclude name:etymology and name:signed
Sarah Hoffmann [Mon, 5 Jul 2021 09:04:16 +0000 (11:04 +0200)]
exclude name:etymology and name:signed

name:etymology contains a description of the name origin and is
thus more informative than search-worthy.

name:signed basically indicates that the feature does not have
a name.

3 years agoremove outdated ICU tokenizer JSON config
Sarah Hoffmann [Mon, 5 Jul 2021 09:01:35 +0000 (11:01 +0200)]
remove outdated ICU tokenizer JSON config

3 years agoMerge pull request #2371 from lonvia/increase-python-version
Sarah Hoffmann [Mon, 5 Jul 2021 08:32:38 +0000 (10:32 +0200)]
Merge pull request #2371 from lonvia/increase-python-version

Increase minimum required Python version to 3.6

3 years agoMerge pull request #2381 from lonvia/reorganise-abbreviations
Sarah Hoffmann [Mon, 5 Jul 2021 08:32:16 +0000 (10:32 +0200)]
Merge pull request #2381 from lonvia/reorganise-abbreviations

Reorganise abbreviation handling

3 years agoadd warning about experimental nature of ICU tokenizer
Sarah Hoffmann [Sun, 4 Jul 2021 08:44:58 +0000 (10:44 +0200)]
add warning about experimental nature of ICU tokenizer

3 years agolimit the number of variants that can be produced
Sarah Hoffmann [Fri, 2 Jul 2021 14:42:13 +0000 (16:42 +0200)]
limit the number of variants that can be produced

3 years agorestrict partial word counting to names of reasoanble length
Sarah Hoffmann [Fri, 2 Jul 2021 13:05:17 +0000 (15:05 +0200)]
restrict partial word counting to names of reasoanble length

The partial word count does not split names to save a bit of time.
The result is that it might enounter unreasonably long names
which in truth consist of multiple words. No accurate statistics
are needed so simply restrict the count to words shorter than
75 characters.

3 years agofix subsequent replacements
Sarah Hoffmann [Thu, 1 Jul 2021 15:56:23 +0000 (17:56 +0200)]
fix subsequent replacements

Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.

3 years agoleave ICU variant properties empty for now
Sarah Hoffmann [Wed, 30 Jun 2021 19:52:33 +0000 (21:52 +0200)]
leave ICU variant properties empty for now

Saving unused properties causes unnecessary duplicates.

3 years agoimport abbreviations from OSM Wiki
Sarah Hoffmann [Wed, 30 Jun 2021 19:37:29 +0000 (21:37 +0200)]
import abbreviations from OSM Wiki

Replaces the variant rules with a slightly cleaned-up
version of the abbreviation lists at
https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations

3 years agoimprove normalization
Sarah Hoffmann [Sat, 26 Jun 2021 17:38:08 +0000 (19:38 +0200)]
improve normalization

Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.

3 years agoonly consider partials in multi-words for initial count
Sarah Hoffmann [Sat, 26 Jun 2021 09:57:09 +0000 (11:57 +0200)]
only consider partials in multi-words for initial count

This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.

3 years agoadd documentation for ICU tokenizer configuration
Sarah Hoffmann [Sat, 26 Jun 2021 08:13:33 +0000 (10:13 +0200)]
add documentation for ICU tokenizer configuration

3 years agoswitch to a more flexible variant description format
Sarah Hoffmann [Thu, 24 Jun 2021 18:02:07 +0000 (20:02 +0200)]
switch to a more flexible variant description format

The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.

3 years agouse yaml tag syntax to mark include files
Sarah Hoffmann [Sun, 20 Jun 2021 21:45:33 +0000 (23:45 +0200)]
use yaml tag syntax to mark include files

3 years agoadd dependency on datrie
Sarah Hoffmann [Tue, 15 Jun 2021 07:02:17 +0000 (09:02 +0200)]
add dependency on datrie