X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/c4b8a3b7680ae51da4c7f0ac0849ecd9fe3d5660..3460a5c230308aa3c0bea66d2fa8502ce647dc36:/docs/develop/Tokenizers.md?ds=sidebyside diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md index 743e637c..f4a55adc 100644 --- a/docs/develop/Tokenizers.md +++ b/docs/develop/Tokenizers.md @@ -6,7 +6,7 @@ tokenizers that use different strategies for normalisation. This page describes how tokenizers are expected to work and the public API that needs to be implemented when creating a new tokenizer. For information on how to configure a specific tokenizer for a database see the -[tokenizer chapter in the administration guide](../admin/Tokenizers.md). +[tokenizer chapter in the Customization Guide](../customize/Tokenizers.md). ## Generic Architecture @@ -50,7 +50,7 @@ tokenizer's internal token lists and creating a list of all token IDs for the specific place. This list is later needed in the PL/pgSQL part where the indexer needs to add the token IDs to the appropriate search tables. To be able to communicate the list between the Python part and the pl/pgSQL trigger, -the placex table contains a special JSONB column `token_info` which is there +the `placex` table contains a special JSONB column `token_info` which is there for the exclusive use of the tokenizer. The Python part of the tokenizer returns a structured information about the @@ -67,12 +67,17 @@ consequently not create any special indexes on it. ### Querying -The tokenizer is responsible for the initial parsing of the query. It needs -to split the query into appropriate words and terms and match them against -the saved tokens in the database. It then returns the list of possibly matching -tokens and the list of possible splits to the query parser. The parser uses -this information to compute all possible interpretations of the query and -rank them accordingly. +At query time, Nominatim builds up multiple _interpretations_ of the search +query. Each of these interpretations is tried against the database in order +of the likelihood with which they match to the search query. The first +interpretation that yields results wins. + +The interpretations are encapsulated in the `SearchDescription` class. An +instance of this class is created by applying a sequence of +_search tokens_ to an initially empty SearchDescription. It is the +responsibility of the tokenizer to parse the search query and derive all +possible sequences of search tokens. To that end the tokenizer needs to parse +the search query and look up matching words in its own data structures. ## Tokenizer API @@ -86,21 +91,16 @@ for a custom tokenizer implementation. ### Directory Structure -Nominatim expects two files for a tokenizer: - -* `nominiatim/tokenizer/_tokenizer.py` containing the Python part of the - implementation -* `lib-php/tokenizer/_tokenizer.php` with the PHP part of the - implementation - -where `` is a unique name for the tokenizer consisting of only lower-case +Nominatim expects a single file `src/nominatim_db/tokenizer/_tokenizer.py` +containing the Python part of the implementation. +`` is a unique name for the tokenizer consisting of only lower-case letters, digits and underscore. A tokenizer also needs to install some SQL functions. By convention, these should be placed in `lib-sql/tokenizer`. If the tokenizer has a default configuration file, this should be saved in the `settings/_tokenizer.`. -### Configuration and Persistance +### Configuration and Persistence Tokenizers may define custom settings for their configuration. All settings must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or @@ -125,18 +125,18 @@ class as defined below. ### Python Tokenizer Class -All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer` +All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer` and implement the abstract functions defined there. -::: nominatim.tokenizer.base.AbstractTokenizer - rendering: - heading_level: 4 +::: nominatim_db.tokenizer.base.AbstractTokenizer + options: + heading_level: 6 ### Python Analyzer Class -::: nominatim.tokenizer.base.AbstractAnalyzer - rendering: - heading_level: 4 +::: nominatim_db.tokenizer.base.AbstractAnalyzer + options: + heading_level: 6 ### PL/pgSQL Functions @@ -185,128 +185,95 @@ be listed with a semicolon as delimiter. Must be NULL when the place has no house numbers. ```sql -FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[] +FUNCTION token_is_street_address(info JSONB) RETURNS BOOLEAN ``` -Return the match token IDs by which to search a matching street from the -`addr:street` tag. These IDs will be matched against the IDs supplied by -`token_get_name_match_tokens`. Must be NULL when the place has no `addr:street` -tag. +Return true if this is an object that should be parented against a street. +Only relevant for objects with address rank 30. ```sql -FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[] +FUNCTION token_has_addr_street(info JSONB) RETURNS BOOLEAN ``` -Return the match token IDs by which to search a matching place from the -`addr:place` tag. These IDs will be matched against the IDs supplied by -`token_get_name_match_tokens`. Must be NULL when the place has no `addr:place` -tag. - -```sql -FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[] -``` +Return true if there are street names to match against for finding the +parent of the object. -Return the search token IDs extracted from the `addr:place` tag. These tokens -are used for searches by address when no matching place can be found in the -database. Must be NULL when the place has no `addr:place` tag. ```sql -CREATE TYPE token_addresstoken AS ( - key TEXT, - match_tokens INT[], - search_tokens INT[] -); - -FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken +FUNCTION token_has_addr_place(info JSONB) RETURNS BOOLEAN ``` -Return the match and search token IDs for explicit `addr:*` tags for the place -other than `addr:street` and `addr:place`. For each address item there are -three pieces of information returned: - - * _key_ contains the type of address item (city, county, etc.). This is the - key handed in with the `address` dictionary. - * *match_tokens* is the list of token IDs used to find the corresponding - place object for the address part. The list is matched against the IDs - from `token_get_name_match_tokens`. - * *search_tokens* is the list of token IDs under which to search the address - item. It is used when no corresponding place object was found. +Return true if there are place names to match against for finding the +parent of the object. ```sql -FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT +FUNCTION token_matches_street(info JSONB, street_tokens INTEGER[]) RETURNS BOOLEAN ``` -Return the normalized version of the given postcode. This function must return -the same value as the Python function `AbstractAnalyzer->normalize_postcode()`. +Check if the given tokens (previously saved from `token_get_name_match_tokens()`) +match against the `addr:street` tag name. Must return either NULL or FALSE +when the place has no `addr:street` tag. ```sql -FUNCTION token_strip_info(info JSONB) RETURNS JSONB +FUNCTION token_matches_place(info JSONB, place_tokens INTEGER[]) RETURNS BOOLEAN ``` -Return the part of the `token_info` field that should be stored in the database -permanently. The indexer calls this function when all processing is done and -replaces the content of the `token_info` column with the returned value before -the trigger stores the information in the database. May return NULL if no -information should be stored permanently. - -### PHP Tokenizer class - -The PHP tokenizer class is instantiated once per request and responsible for -analyzing the incoming query. Multiple requests may be in flight in -parallel. +Check if the given tokens (previously saved from `token_get_name_match_tokens()`) +match against the `addr:place` tag name. Must return either NULL or FALSE +when the place has no `addr:place` tag. -The class is expected to be found under the -name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file -`tokenizer/tokenizer.php` in the project directory. This file must be created -when the tokenizer is first set up on import. The file should initialize any -configuration variables by setting PHP constants and then require the file -with the actual implementation of the tokenizer. -The tokenizer class must implement the following functions: - -```php -public function __construct(object &$oDB) +```sql +FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[] ``` -The constructor of the class receives a database connection that can be used -to query persistent data in the database. +Return the search token IDs extracted from the `addr:place` tag. These tokens +are used for searches by address when no matching place can be found in the +database. Must be NULL when the place has no `addr:place` tag. -```php -public function checkStatus() +```sql +FUNCTION token_get_address_keys(info JSONB) RETURNS SETOF TEXT ``` -Check that the tokenizer can access its persistent data structures. If there -is an issue, throw an `\Exception`. +Return the set of keys for which address information is provided. This +should correspond to the list of (relevant) `addr:*` tags with the `addr:` +prefix removed or the keys used in the `address` dictionary of the place info. -```php -public function normalizeString(string $sTerm) : string +```sql +FUNCTION token_get_address_search_tokens(info JSONB, key TEXT) RETURNS INTEGER[] ``` -Normalize string to a form to be used for comparisons when reordering results. -Nominatim reweighs results how well the final display string matches the actual -query. Before comparing result and query, names and query are normalised against -this function. The tokenizer can thus remove all properties that should not be -taken into account for reweighing, e.g. special characters or case. +Return the array of search tokens for the given address part. `key` can be +expected to be one of those returned with `token_get_address_keys()`. The +search tokens are added to the address search vector of the place, when no +corresponding OSM object could be found for the given address part from which +to copy the name information. -```php -public function tokensForSpecialTerm(string $sTerm) : array +```sql +FUNCTION token_matches_address(info JSONB, key TEXT, tokens INTEGER[]) ``` -Return the list of special term tokens that match the given term. +Check if the given tokens match against the address part `key`. + +__Warning:__ the tokens that are handed in are the lists previously saved +from `token_get_name_search_tokens()`, _not_ from the match token list. This +is an historical oddity which will be fixed at some point in the future. +Currently, tokenizers are encouraged to make sure that matching works against +both the search token list and the match token list. -```php -public function extractTokensFromPhrases(array &$aPhrases) : TokenList +```sql +FUNCTION token_get_postcode(info JSONB) RETURNS TEXT ``` -Parse the given phrases, splitting them into word lists and retrieve the -matching tokens. +Return the postcode for the object, if any exists. The postcode must be in +the form that should also be presented to the end-user. -For each phrase in the list of phrases, the function must analyse the phrase -string and then call `setWordSets()` to communicate the result of the analysis. -A word set is a list of strings, where each string refers to a search token. -A phrase may have multiple interpretations. Therefore a list of word sets is -usually attached to the phrase. The search tokens themselves are returned -by the function in an associative array, where the key corresponds to the -strings given in the word sets. The value is a list of search tokens. Thus -a single string in the list of word sets may refer to multiple search tokens. +```sql +FUNCTION token_strip_info(info JSONB) RETURNS JSONB +``` +Return the part of the `token_info` field that should be stored in the database +permanently. The indexer calls this function when all processing is done and +replaces the content of the `token_info` column with the returned value before +the trigger stores the information in the database. May return NULL if no +information should be stored permanently.