X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/e25e268e2e730a81e0bb9e4528947fdc86ca56dd..c4b8a3b7680ae51da4c7f0ac0849ecd9fe3d5660:/docs/develop/Tokenizers.md diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md index e10587a6..743e637c 100644 --- a/docs/develop/Tokenizers.md +++ b/docs/develop/Tokenizers.md @@ -73,3 +73,240 @@ the saved tokens in the database. It then returns the list of possibly matching tokens and the list of possible splits to the query parser. The parser uses this information to compute all possible interpretations of the query and rank them accordingly. + +## Tokenizer API + +The following section describes the functions that need to be implemented +for a custom tokenizer implementation. + +!!! warning + This API is currently in early alpha status. While this API is meant to + be a public API on which other tokenizers may be implemented, the API is + far away from being stable at the moment. + +### Directory Structure + +Nominatim expects two files for a tokenizer: + +* `nominiatim/tokenizer/_tokenizer.py` containing the Python part of the + implementation +* `lib-php/tokenizer/_tokenizer.php` with the PHP part of the + implementation + +where `` is a unique name for the tokenizer consisting of only lower-case +letters, digits and underscore. A tokenizer also needs to install some SQL +functions. By convention, these should be placed in `lib-sql/tokenizer`. + +If the tokenizer has a default configuration file, this should be saved in +the `settings/_tokenizer.`. + +### Configuration and Persistance + +Tokenizers may define custom settings for their configuration. All settings +must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or +persistent. Transient settings are loaded from the configuration file when +Nominatim is started and may thus be changed at any time. Persistent settings +are tied to a database installation and must only be read during installation +time. If they are needed for the runtime then they must be saved into the +`nominatim_properties` table and later loaded from there. + +### The Python module + +The Python module is expect to export a single factory function: + +```python +def create(dsn: str, data_dir: Path) -> AbstractTokenizer +``` + +The `dsn` parameter contains the DSN of the Nominatim database. The `data_dir` +is a directory in the project directory that the tokenizer may use to save +database-specific data. The function must return the instance of the tokenizer +class as defined below. + +### Python Tokenizer Class + +All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer` +and implement the abstract functions defined there. + +::: nominatim.tokenizer.base.AbstractTokenizer + rendering: + heading_level: 4 + +### Python Analyzer Class + +::: nominatim.tokenizer.base.AbstractAnalyzer + rendering: + heading_level: 4 + +### PL/pgSQL Functions + +The tokenizer must provide access functions for the `token_info` column +to the indexer which extracts the necessary information for the global +search tables. If the tokenizer needs additional SQL functions for private +use, then these functions must be prefixed with `token_` in order to ensure +that there are no naming conflicts with the SQL indexer code. + +The following functions are expected: + +```sql +FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return an array of token IDs of search terms that should match +the name(s) for the given place. These tokens are used to look up the place +by name and, where the place functions as part of an address for another place, +by address. Must return NULL when the place has no name. + +```sql +FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return an array of token IDs of full names of the place that should be used +to match addresses. The list of match tokens is usually more strict than +search tokens as it is used to find a match between two OSM tag values which +are expected to contain matching full names. Partial terms should not be +used for match tokens. Must return NULL when the place has no name. + +```sql +FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return an array of token IDs of house number tokens that apply to the place. +Note that a place may have multiple house numbers, for example when apartments +each have their own number. Must be NULL when the place has no house numbers. + +```sql +FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT +``` + +Return the house number(s) in the normalized form that can be matched against +a house number token text. If a place has multiple house numbers they must +be listed with a semicolon as delimiter. Must be NULL when the place has no +house numbers. + +```sql +FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return the match token IDs by which to search a matching street from the +`addr:street` tag. These IDs will be matched against the IDs supplied by +`token_get_name_match_tokens`. Must be NULL when the place has no `addr:street` +tag. + +```sql +FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return the match token IDs by which to search a matching place from the +`addr:place` tag. These IDs will be matched against the IDs supplied by +`token_get_name_match_tokens`. Must be NULL when the place has no `addr:place` +tag. + +```sql +FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[] +``` + +Return the search token IDs extracted from the `addr:place` tag. These tokens +are used for searches by address when no matching place can be found in the +database. Must be NULL when the place has no `addr:place` tag. + +```sql +CREATE TYPE token_addresstoken AS ( + key TEXT, + match_tokens INT[], + search_tokens INT[] +); + +FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken +``` + +Return the match and search token IDs for explicit `addr:*` tags for the place +other than `addr:street` and `addr:place`. For each address item there are +three pieces of information returned: + + * _key_ contains the type of address item (city, county, etc.). This is the + key handed in with the `address` dictionary. + * *match_tokens* is the list of token IDs used to find the corresponding + place object for the address part. The list is matched against the IDs + from `token_get_name_match_tokens`. + * *search_tokens* is the list of token IDs under which to search the address + item. It is used when no corresponding place object was found. + +```sql +FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT +``` + +Return the normalized version of the given postcode. This function must return +the same value as the Python function `AbstractAnalyzer->normalize_postcode()`. + +```sql +FUNCTION token_strip_info(info JSONB) RETURNS JSONB +``` + +Return the part of the `token_info` field that should be stored in the database +permanently. The indexer calls this function when all processing is done and +replaces the content of the `token_info` column with the returned value before +the trigger stores the information in the database. May return NULL if no +information should be stored permanently. + +### PHP Tokenizer class + +The PHP tokenizer class is instantiated once per request and responsible for +analyzing the incoming query. Multiple requests may be in flight in +parallel. + +The class is expected to be found under the +name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file +`tokenizer/tokenizer.php` in the project directory. This file must be created +when the tokenizer is first set up on import. The file should initialize any +configuration variables by setting PHP constants and then require the file +with the actual implementation of the tokenizer. + +The tokenizer class must implement the following functions: + +```php +public function __construct(object &$oDB) +``` + +The constructor of the class receives a database connection that can be used +to query persistent data in the database. + +```php +public function checkStatus() +``` + +Check that the tokenizer can access its persistent data structures. If there +is an issue, throw an `\Exception`. + +```php +public function normalizeString(string $sTerm) : string +``` + +Normalize string to a form to be used for comparisons when reordering results. +Nominatim reweighs results how well the final display string matches the actual +query. Before comparing result and query, names and query are normalised against +this function. The tokenizer can thus remove all properties that should not be +taken into account for reweighing, e.g. special characters or case. + +```php +public function tokensForSpecialTerm(string $sTerm) : array +``` + +Return the list of special term tokens that match the given term. + +```php +public function extractTokensFromPhrases(array &$aPhrases) : TokenList +``` + +Parse the given phrases, splitting them into word lists and retrieve the +matching tokens. + +For each phrase in the list of phrases, the function must analyse the phrase +string and then call `setWordSets()` to communicate the result of the analysis. +A word set is a list of strings, where each string refers to a search token. +A phrase may have multiple interpretations. Therefore a list of word sets is +usually attached to the phrase. The search tokens themselves are returned +by the function in an associative array, where the key corresponds to the +strings given in the word sets. The value is a list of search tokens. Thus +a single string in the list of word sets may refer to multiple search tokens. +