]> git.openstreetmap.org Git - nominatim.git/log
nominatim.git
3 years agoreorganize and complete tests around generic token analysis
Sarah Hoffmann [Wed, 6 Oct 2021 15:03:37 +0000 (17:03 +0200)]
reorganize and complete tests around generic token analysis

3 years agoadd tests for sanitizer tagging language
Sarah Hoffmann [Wed, 6 Oct 2021 10:29:25 +0000 (12:29 +0200)]
add tests for sanitizer tagging language

3 years agoapply variants by languages
Sarah Hoffmann [Tue, 5 Oct 2021 15:18:10 +0000 (17:18 +0200)]
apply variants by languages

Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.

3 years agouse analyser provided in the 'analyzer' property
Sarah Hoffmann [Tue, 5 Oct 2021 12:10:32 +0000 (14:10 +0200)]
use analyser provided in the 'analyzer' property

Implements per-name choice of analyzer. If a non-default
analyzer is choosen, then the 'word' identifier is extended
with the name of the ana;yzer, so that we still have unique
items.

3 years agoremove support for properties on variants
Sarah Hoffmann [Tue, 5 Oct 2021 08:29:36 +0000 (10:29 +0200)]
remove support for properties on variants

Those are not going to be used in the near future, so no need to
carry that code around just now.

3 years agoprecompute replacements while loading configuration
Sarah Hoffmann [Tue, 5 Oct 2021 08:20:08 +0000 (10:20 +0200)]
precompute replacements while loading configuration

3 years agomove parsing of token analysis config to analyzer
Sarah Hoffmann [Mon, 4 Oct 2021 16:31:58 +0000 (18:31 +0200)]
move parsing of token analysis config to analyzer

Adds a second callback for the analyzer which is responsible
for parsing the configuration rules and converting it to
whatever format necessary. This way, each analyzer implementation
can define its own configuration rules.

3 years agomake token analyzers configurable modules
Sarah Hoffmann [Mon, 4 Oct 2021 15:34:30 +0000 (17:34 +0200)]
make token analyzers configurable modules

Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.

3 years agoextend ICU config to accomodate multiple analysers
Sarah Hoffmann [Mon, 4 Oct 2021 14:40:28 +0000 (16:40 +0200)]
extend ICU config to accomodate multiple analysers

Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.

3 years agomove flatten_config_list into config module
Sarah Hoffmann [Mon, 4 Oct 2021 09:56:54 +0000 (11:56 +0200)]
move flatten_config_list into config module

For general usage by other modules.

3 years agoMerge pull request #2458 from lonvia/add-tokenizer-preprocessing
Sarah Hoffmann [Fri, 1 Oct 2021 19:53:34 +0000 (21:53 +0200)]
Merge pull request #2458 from lonvia/add-tokenizer-preprocessing

Add a "sanitation" step for name and address tags before token processing

3 years agoreplace test variable for PG env tests
Sarah Hoffmann [Fri, 1 Oct 2021 08:51:41 +0000 (10:51 +0200)]
replace test variable for PG env tests

'tty' was removed in PG14 and causes an error.

3 years agoadd unit tests for new sanatizer functions
Sarah Hoffmann [Fri, 1 Oct 2021 07:50:17 +0000 (09:50 +0200)]
add unit tests for new sanatizer functions

3 years agointroduce sanitizer step before token analysis
Sarah Hoffmann [Thu, 30 Sep 2021 19:30:13 +0000 (21:30 +0200)]
introduce sanitizer step before token analysis

Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.

3 years agounify ICUNameProcessorRules and ICURuleLoader
Sarah Hoffmann [Wed, 29 Sep 2021 15:37:04 +0000 (17:37 +0200)]
unify ICUNameProcessorRules and ICURuleLoader

There is no need for the additional layer of indirection that
the ICUNameProcessorRules class adds. The ICURuleLoader can
fill the database properties directly.

3 years agofix typo
Sarah Hoffmann [Wed, 29 Sep 2021 12:16:09 +0000 (14:16 +0200)]
fix typo

3 years agoexport more data for the tokenizer name preparation
Sarah Hoffmann [Wed, 29 Sep 2021 09:54:14 +0000 (11:54 +0200)]
export more data for the tokenizer name preparation

Adds class, type, country and rank to the exported information
and removes the rather odd hack for countries. Whether a place
represents a country boundary can now be computed by the tokenizer.

3 years agoadd wrapper class for place data passed to tokenizer
Sarah Hoffmann [Wed, 29 Sep 2021 08:37:54 +0000 (10:37 +0200)]
add wrapper class for place data passed to tokenizer

This is mostly for convenience and documentation purposes.

3 years agoMerge pull request #2455 from lonvia/adjust-address-levels-slovakia
Sarah Hoffmann [Tue, 28 Sep 2021 09:21:08 +0000 (11:21 +0200)]
Merge pull request #2455 from lonvia/adjust-address-levels-slovakia

Adjust address levels for boundaries in Slovakia

3 years agoMerge pull request #2454 from lonvia/sort-out-token-assignment-in-sql
Sarah Hoffmann [Tue, 28 Sep 2021 07:45:15 +0000 (09:45 +0200)]
Merge pull request #2454 from lonvia/sort-out-token-assignment-in-sql

ICU tokenizer: switch match method to using partial terms

3 years agoadjust address levels for boundaries in Slovakia
Sarah Hoffmann [Mon, 27 Sep 2021 21:32:11 +0000 (23:32 +0200)]
adjust address levels for boundaries in Slovakia

Levels choosen according to OSM wiki. Mainly moves admin_level 6
to county level and admin_level 8 to city/town level. Higher
levels are adjusted accordingly.

Fixes #2453.

3 years agoadapt tests to new ICU address token handling
Sarah Hoffmann [Mon, 27 Sep 2021 15:36:23 +0000 (17:36 +0200)]
adapt tests to new ICU address token handling

3 years agoremove unused parameter
Sarah Hoffmann [Mon, 27 Sep 2021 12:58:43 +0000 (14:58 +0200)]
remove unused parameter

3 years agoMerge pull request #2452 from lonvia/update-houses-on-street-name-change
Sarah Hoffmann [Mon, 27 Sep 2021 12:55:50 +0000 (14:55 +0200)]
Merge pull request #2452 from lonvia/update-houses-on-street-name-change

Force update of surrounding houses when street or place name changes

3 years agoicu tokenizer: switch to matching against partial names
Sarah Hoffmann [Thu, 23 Sep 2021 14:57:24 +0000 (16:57 +0200)]
icu tokenizer: switch to matching against partial names

When matching address parts from addr:* tags against place names,
the address names where so far converted to full names and compared
those to the place names. This can become problematic with the new
ICU tokenizer once we introduce creation of different variants
depending on the place name context. It wouldn't be clear which
variant to produce to get a match, so we would have to create all of
them. To work around this issue, switch to using the partial terms
for matching. This introduces a larger fuzziness between matches but
that shouldn't be a problem because matching is always geographically
restricted.

The search terms created for address parts have a different problem:
they are already created before we even know if they are going to be
used. This can lead to spurious entries in the word table, which slows
down searching. This problem can also be circumvented by using only
partial terms for the search terms. In terms of searching that means
that the address terms would not get the full-word boost, but given
that the case where an address part does not exist as an OSM object
should be the exception, this is likely acceptable.

3 years agoadapt documentation for SQL tokenizer interface
Sarah Hoffmann [Wed, 22 Sep 2021 20:54:14 +0000 (22:54 +0200)]
adapt documentation for SQL tokenizer interface

3 years agomove name matching into tokenizer module
Sarah Hoffmann [Wed, 22 Sep 2021 20:20:02 +0000 (22:20 +0200)]
move name matching into tokenizer module

Instead of requesting the match tokens from the tokenizer
when looking for parent streets/places and address parts,
hand in the saved tokens and ask if they match. This gives
the tokenizer more freedom to decide how name matching
should be done.

3 years agoforce update on rank30 children when place name changes
Sarah Hoffmann [Mon, 27 Sep 2021 09:04:17 +0000 (11:04 +0200)]
force update on rank30 children when place name changes

Name changes may have an effect on parenting. Don't update
surrounding rank30 objects with addr:place tags as this is
potentially too expensive.

3 years agoforce update of surrounding houses when street name changes
Sarah Hoffmann [Mon, 27 Sep 2021 08:20:26 +0000 (10:20 +0200)]
force update of surrounding houses when street name changes

When the street changes its name then this may cause changes
in the parenting of rank-30 objects with an addr:street
tag.

Fixes #2242.

3 years agoslightly increase radius to look for postcodes
Sarah Hoffmann [Fri, 24 Sep 2021 21:56:42 +0000 (23:56 +0200)]
slightly increase radius to look for postcodes

3 years agoMerge pull request #2449 from lonvia/address-ranking-spain
Sarah Hoffmann [Fri, 24 Sep 2021 20:48:21 +0000 (22:48 +0200)]
Merge pull request #2449 from lonvia/address-ranking-spain

Adjust address ranks for Spain

3 years agoadjust address ranks for Spain
Sarah Hoffmann [Fri, 24 Sep 2021 15:37:31 +0000 (17:37 +0200)]
adjust address ranks for Spain

Adjusts levels for boundaries according to the list on
https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative

* no admin_level 5, so drop that from addresses
* admin_level 6 has the province
* admin_level 7 has the county when it exists

Also reranks place=province so that it matches up with
admin_level 6 and introduces place=civil_parish which
is used as a place node for some admin_level=9 boundaries
in Galicia.

3 years agoMerge pull request #2447 from lonvia/fix-dynamic-address-assignment
Sarah Hoffmann [Sun, 19 Sep 2021 13:57:28 +0000 (15:57 +0200)]
Merge pull request #2447 from lonvia/fix-dynamic-address-assignment

Fix dynamic assignment of address parts

3 years agoCI: install locale for CentOS
Sarah Hoffmann [Sun, 19 Sep 2021 11:49:11 +0000 (13:49 +0200)]
CI: install locale for CentOS

3 years agoRemove the installation warning
Sarah Hoffmann [Sun, 19 Sep 2021 11:01:32 +0000 (13:01 +0200)]
Remove the installation warning

Installation has become a lot easier.

3 years agofix dynamic assignment of address parts
Sarah Hoffmann [Sun, 19 Sep 2021 08:54:05 +0000 (10:54 +0200)]
fix dynamic assignment of address parts

A boolean check for dynamic changes of address parts is not
sufficient. The order of choice should be:

 1. an addr:* part matches the name
 2. the address part surrounds the object
 3. the address part was declared as isaddress

The implementation uses a slightly different ordering
to avoid geometry checks unless strictly necessary (isaddress
is false and no matching address).

See #2446.

3 years agoMerge pull request #2440 from lonvia/generic-config-loader
Sarah Hoffmann [Sat, 4 Sep 2021 15:41:15 +0000 (17:41 +0200)]
Merge pull request #2440 from lonvia/generic-config-loader

Add generic loader for YAML configuration files

3 years agofix indent
Sarah Hoffmann [Sat, 4 Sep 2021 08:30:35 +0000 (10:30 +0200)]
fix indent

3 years agouse yaml config loader for country info
Sarah Hoffmann [Fri, 3 Sep 2021 22:22:21 +0000 (00:22 +0200)]
use yaml config loader for country info

3 years agoadd tests for generic YAML config reader
Sarah Hoffmann [Fri, 3 Sep 2021 20:31:30 +0000 (22:31 +0200)]
add tests for generic YAML config reader

3 years agointroduce generic YAML config loader
Sarah Hoffmann [Fri, 3 Sep 2021 16:16:12 +0000 (18:16 +0200)]
introduce generic YAML config loader

Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.

Move ICU tokenizer to use new load function.

3 years agoMerge pull request #2437 from lonvia/tweak-ranking-searches
Sarah Hoffmann [Fri, 3 Sep 2021 12:16:23 +0000 (14:16 +0200)]
Merge pull request #2437 from lonvia/tweak-ranking-searches

Some more tweaks for search interpretation

3 years agoMerge pull request #2436 from lonvia/country-configuration
Sarah Hoffmann [Fri, 3 Sep 2021 06:55:36 +0000 (08:55 +0200)]
Merge pull request #2436 from lonvia/country-configuration

Move configuration of default languages into a configuration file

3 years agoreduce penalty for special searches by name
Sarah Hoffmann [Thu, 2 Sep 2021 16:13:45 +0000 (18:13 +0200)]
reduce penalty for special searches by name

Additional penalty for special terms with operator None
should only go to near searches. To reduce the number
of produced searches, restrict the none operator to
appear only in conjunction with the name.

3 years agofurther increase penalty on housenumbers without numbers
Sarah Hoffmann [Thu, 2 Sep 2021 16:11:49 +0000 (18:11 +0200)]
further increase penalty on housenumbers without numbers

Make the penality dependent on the length of the token:
no penalty for one letter house numbers and increasing one
for more letters.

3 years agoremove language and partition from name import
Sarah Hoffmann [Wed, 1 Sep 2021 09:37:30 +0000 (11:37 +0200)]
remove language and partition from name import

3 years agoread partition and languages from config file
Sarah Hoffmann [Wed, 1 Sep 2021 21:51:53 +0000 (23:51 +0200)]
read partition and languages from config file

3 years agomove country name generation to country_info module
Sarah Hoffmann [Wed, 1 Sep 2021 20:08:39 +0000 (22:08 +0200)]
move country name generation to country_info module

3 years agomove generation of country tables in own module
Sarah Hoffmann [Wed, 1 Sep 2021 14:02:10 +0000 (16:02 +0200)]
move generation of country tables in own module

3 years agoadd country configuration
Sarah Hoffmann [Wed, 1 Sep 2021 09:27:03 +0000 (11:27 +0200)]
add country configuration

The new configuration saves the default language(s) originally
maintained in the OSM wiki as well as the partition information.

3 years agoMerge pull request #2435 from lonvia/simplified-to-traditional-chinese
Sarah Hoffmann [Tue, 31 Aug 2021 13:29:26 +0000 (15:29 +0200)]
Merge pull request #2435 from lonvia/simplified-to-traditional-chinese

icu: normalise simplified to traditional chinese

3 years agoicu: normalise simplified to traditional chinese
Sarah Hoffmann [Tue, 31 Aug 2021 09:18:34 +0000 (11:18 +0200)]
icu: normalise simplified to traditional chinese

The conversion is unambigious in most cases, so that the
information loss is minimal.

3 years agoMerge pull request #2434 from lonvia/vagrant-scripts-in-actions
Sarah Hoffmann [Sun, 29 Aug 2021 08:11:59 +0000 (10:11 +0200)]
Merge pull request #2434 from lonvia/vagrant-scripts-in-actions

Test installation instructions via CI

3 years agoCI: use packaged source also for test runs
Sarah Hoffmann [Mon, 23 Aug 2021 22:31:20 +0000 (00:31 +0200)]
CI: use packaged source also for test runs

3 years agoCI: unify jobs for different vagrant scripts
Sarah Hoffmann [Mon, 23 Aug 2021 15:41:13 +0000 (17:41 +0200)]
CI: unify jobs for different vagrant scripts

3 years agoadd workflow for centos 8
Sarah Hoffmann [Sun, 22 Aug 2021 16:42:20 +0000 (18:42 +0200)]
add workflow for centos 8

3 years agoCI: use vagrant scripts for import tests
Sarah Hoffmann [Sat, 21 Aug 2021 08:45:22 +0000 (10:45 +0200)]
CI: use vagrant scripts for import tests

Use vanilla docker images of Ubuntu and leave the setup
to the vagrant scripts. Then do the usual import tests.

Also fixes a couple of issues found with the scripts

3 years agoMerge pull request #2432 from Mastercuber/patch-1
Sarah Hoffmann [Sun, 22 Aug 2021 07:32:31 +0000 (09:32 +0200)]
Merge pull request #2432 from Mastercuber/patch-1

Added postcode

3 years agoAdded postcode
Mastercuber [Sun, 22 Aug 2021 00:52:41 +0000 (02:52 +0200)]
Added postcode

Added postcode to the list of addressdetails

3 years agoAdd link to fixthemap to issue template
Sarah Hoffmann [Sat, 21 Aug 2021 18:36:16 +0000 (20:36 +0200)]
Add link to fixthemap to issue template

3 years agoMerge pull request #2429 from lonvia/place-name-to-admin-boundary
Sarah Hoffmann [Sat, 21 Aug 2021 08:21:39 +0000 (10:21 +0200)]
Merge pull request #2429 from lonvia/place-name-to-admin-boundary

Indexing: move linking of places to the preparation stage

3 years agomove linking of places to the preparation stage
Sarah Hoffmann [Fri, 20 Aug 2021 19:53:13 +0000 (21:53 +0200)]
move linking of places to the preparation stage

Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.

3 years agoMerge pull request #2428 from lonvia/rename-icu-tokenizer
Sarah Hoffmann [Wed, 18 Aug 2021 13:02:19 +0000 (15:02 +0200)]
Merge pull request #2428 from lonvia/rename-icu-tokenizer

Rename legacy_icu tokenizer to icu tokenizer

3 years agoadapt CI workflow to new tokenizer name
Sarah Hoffmann [Wed, 18 Aug 2021 07:08:20 +0000 (09:08 +0200)]
adapt CI workflow to new tokenizer name

3 years agorename legacy_icu tokenizer to icu tokenizer
Sarah Hoffmann [Tue, 17 Aug 2021 21:11:47 +0000 (23:11 +0200)]
rename legacy_icu tokenizer to icu tokenizer

The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.

3 years agoMerge pull request #2427 from lonvia/remove-us-states-special-casing
Sarah Hoffmann [Tue, 17 Aug 2021 19:55:32 +0000 (21:55 +0200)]
Merge pull request #2427 from lonvia/remove-us-states-special-casing

Move US state hack into legacy tokenizer

3 years agomove special hack for US states to legacy tokenizer
Sarah Hoffmann [Tue, 17 Aug 2021 12:28:55 +0000 (14:28 +0200)]
move special hack for US states to legacy tokenizer

The hack for IL, AL and LA is only needed because these abbreviations
are removed by the legacy tokenizer as a stop word. There is no need
to keep the hack for future tokenizers. Move it therefore to the
token extraction function.

3 years agoadd tests for US state hacks
Sarah Hoffmann [Tue, 17 Aug 2021 08:49:07 +0000 (10:49 +0200)]
add tests for US state hacks

IL, AS and LA are replaced with the US state in Geocode because
the old tokenizer would simply remove the abbreviations otherwise.

3 years agoMerge pull request #2425 from lonvia/tokenizer-documentation
Sarah Hoffmann [Tue, 17 Aug 2021 07:38:03 +0000 (09:38 +0200)]
Merge pull request #2425 from lonvia/tokenizer-documentation

Introduce official Tokenizer API

3 years agoadd mkdocstrings requirement for building docs
Sarah Hoffmann [Mon, 16 Aug 2021 09:48:25 +0000 (11:48 +0200)]
add mkdocstrings requirement for building docs

mkdocstrings also needs access to the Python sources, so set
a PYTHONPATH accordingly. This makes running mkdocs directly
a bit awkward, therefore add a `make serve-doc` target.

3 years agodocs: extend explanation of query phrase
Sarah Hoffmann [Mon, 16 Aug 2021 07:57:01 +0000 (09:57 +0200)]
docs: extend explanation of query phrase

3 years agoadd documentation for PHP part of tokenizer
Sarah Hoffmann [Thu, 12 Aug 2021 09:21:50 +0000 (11:21 +0200)]
add documentation for PHP part of tokenizer

3 years agophp: make word list a first-class object
Sarah Hoffmann [Thu, 12 Aug 2021 09:09:46 +0000 (11:09 +0200)]
php: make word list a first-class object

This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.

3 years agoremove country restriction from tokenizer
Sarah Hoffmann [Thu, 29 Jul 2021 19:25:59 +0000 (21:25 +0200)]
remove country restriction from tokenizer

Restricting tokens due to the search context is better done in
the generic search part instead of repeating the same test in
every tokenizer implementation.

3 years agodocument tokenizer SQL interface
Sarah Hoffmann [Tue, 10 Aug 2021 15:31:04 +0000 (17:31 +0200)]
document tokenizer SQL interface

3 years agodefine formal public Python interface for tokenizer
Sarah Hoffmann [Tue, 10 Aug 2021 12:51:35 +0000 (14:51 +0200)]
define formal public Python interface for tokenizer

This introduces an abstract class for the Tokenizer/Analyzer
for documentation purposes.

3 years agodocs: querying and tokenizers
Sarah Hoffmann [Sat, 31 Jul 2021 07:49:29 +0000 (09:49 +0200)]
docs: querying and tokenizers

3 years agodocs: add developer doc page for Tokenizer
Sarah Hoffmann [Thu, 29 Jul 2021 18:54:33 +0000 (20:54 +0200)]
docs: add developer doc page for Tokenizer

3 years agoMerge pull request #2424 from lonvia/multi-country-import
Sarah Hoffmann [Mon, 16 Aug 2021 06:48:28 +0000 (08:48 +0200)]
Merge pull request #2424 from lonvia/multi-country-import

Update instructions for importing multiple regions

3 years agoMerge pull request #2423 from hummeltech/patch-1
Sarah Hoffmann [Sun, 15 Aug 2021 20:00:50 +0000 (22:00 +0200)]
Merge pull request #2423 from hummeltech/patch-1

Fix old paths for `phpcs` when using `make test`

3 years agoignore words without id for status
Sarah Hoffmann [Sun, 15 Aug 2021 15:49:22 +0000 (17:49 +0200)]
ignore words without id for status

3 years agosplit up large setup function
Sarah Hoffmann [Sun, 15 Aug 2021 10:24:13 +0000 (12:24 +0200)]
split up large setup function

3 years agoport multi-region update scripts to nominatim tool
Sarah Hoffmann [Sat, 14 Aug 2021 21:48:06 +0000 (23:48 +0200)]
port multi-region update scripts to nominatim tool

Also updates the documentation. For the simple case of just
importing multiple regions, provide simplified instructions
that use the new multi-file import feature.

Fixes #2365.

3 years agoupdate osm2pgsql to 1.5.1
Sarah Hoffmann [Sat, 14 Aug 2021 20:46:35 +0000 (22:46 +0200)]
update osm2pgsql to 1.5.1

3 years agoallow multiple files for the import command
Sarah Hoffmann [Sat, 14 Aug 2021 19:42:21 +0000 (21:42 +0200)]
allow multiple files for the import command

The files are forwarded to osm2pgsql which is now able to merge
them correctly.

3 years agoFix old paths for `phpcs` when using `make test`
David Hummel [Thu, 12 Aug 2021 20:34:18 +0000 (13:34 -0700)]
Fix old paths for `phpcs` when using `make test`

These paths no longer exist since db3ced17bbfff00411f506d8c84419c875959d5e, they are now all located under `lib-php`

3 years agoMerge pull request #2413 from osm-search/helm-chart
Sarah Hoffmann [Sun, 8 Aug 2021 09:09:36 +0000 (11:09 +0200)]
Merge pull request #2413 from osm-search/helm-chart

Installation docs - link to Kubernetes install project

3 years agoInstallation docs - link to Kubernetes install project
mtmail [Tue, 3 Aug 2021 10:02:35 +0000 (12:02 +0200)]
Installation docs - link to Kubernetes install project

As reported by @robjuz in https://github.com/osm-search/Nominatim/discussions/2412

3 years agoMerge pull request #2408 from lonvia/icu-change-word-table-layout
Sarah Hoffmann [Wed, 28 Jul 2021 12:28:49 +0000 (14:28 +0200)]
Merge pull request #2408 from lonvia/icu-change-word-table-layout

Change table layout of word table for ICU tokenizer

3 years agophp: force use of global Exception class
Sarah Hoffmann [Sun, 25 Jul 2021 14:29:04 +0000 (16:29 +0200)]
php: force use of global Exception class

3 years agofix Python linitin errors
Sarah Hoffmann [Sun, 25 Jul 2021 13:30:47 +0000 (15:30 +0200)]
fix Python linitin errors

3 years agofix linitin issues in PHP
Sarah Hoffmann [Sun, 25 Jul 2021 13:13:49 +0000 (15:13 +0200)]
fix linitin issues in PHP

3 years agoreinstate word column in icu word table
Sarah Hoffmann [Sun, 25 Jul 2021 13:08:11 +0000 (15:08 +0200)]
reinstate word column in icu word table

Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.

3 years agobdd tests: do not query word table directly
Sarah Hoffmann [Sat, 24 Jul 2021 10:12:31 +0000 (12:12 +0200)]
bdd tests: do not query word table directly

The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.

3 years agoadapt unit test for new word table
Sarah Hoffmann [Thu, 22 Jul 2021 15:24:43 +0000 (17:24 +0200)]
adapt unit test for new word table

Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.

3 years agoconvert word info column to json before copying
Sarah Hoffmann [Wed, 21 Jul 2021 09:37:14 +0000 (11:37 +0200)]
convert word info column to json before copying

3 years agoadapt special terms lookup to new word table
Sarah Hoffmann [Wed, 21 Jul 2021 08:52:34 +0000 (10:52 +0200)]
adapt special terms lookup to new word table

3 years agoswitch word tokens to new word table layout
Sarah Hoffmann [Wed, 21 Jul 2021 08:41:38 +0000 (10:41 +0200)]
switch word tokens to new word table layout

3 years agoswitch special phrases to new word table format
Sarah Hoffmann [Tue, 20 Jul 2021 19:11:01 +0000 (21:11 +0200)]
switch special phrases to new word table format

3 years agoswitch postcode tokens to new word table layout
Sarah Hoffmann [Tue, 20 Jul 2021 10:11:12 +0000 (12:11 +0200)]
switch postcode tokens to new word table layout