X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/720c7b751906a41d2cdcfae7c461d74907771037..df6f70d223e8fb3129be03662fa90dfeb561309e:/docs/develop/Tokenizers.md diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md index 2b4da005..a1dae78b 100644 --- a/docs/develop/Tokenizers.md +++ b/docs/develop/Tokenizers.md @@ -91,21 +91,21 @@ for a custom tokenizer implementation. ### Directory Structure -Nominatim expects two files for a tokenizer: +Nominatim expects two files containing the Python part of the implementation: -* `nominatim/tokenizer/_tokenizer.py` containing the Python part of the - implementation -* `lib-php/tokenizer/_tokenizer.php` with the PHP part of the - implementation + * `src/nominatim_db/tokenizer/_tokenizer.py` contains the tokenizer + code used during import and + * `src/nominatim_api/search/_tokenizer.py` has the code used during + query time. -where `` is a unique name for the tokenizer consisting of only lower-case +`` is a unique name for the tokenizer consisting of only lower-case letters, digits and underscore. A tokenizer also needs to install some SQL functions. By convention, these should be placed in `lib-sql/tokenizer`. If the tokenizer has a default configuration file, this should be saved in -the `settings/_tokenizer.`. +`settings/_tokenizer.`. -### Configuration and Persistance +### Configuration and Persistence Tokenizers may define custom settings for their configuration. All settings must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or @@ -115,9 +115,11 @@ are tied to a database installation and must only be read during installation time. If they are needed for the runtime then they must be saved into the `nominatim_properties` table and later loaded from there. -### The Python module +### The Python modules -The Python module is expect to export a single factory function: +#### `src/nominatim_db/tokenizer/` + +The import Python module is expected to export a single factory function: ```python def create(dsn: str, data_dir: Path) -> AbstractTokenizer @@ -128,20 +130,41 @@ is a directory in the project directory that the tokenizer may use to save database-specific data. The function must return the instance of the tokenizer class as defined below. +#### `src/nominatim_api/search/` + +The query-time Python module must also export a factory function: + +``` python +def create_query_analyzer(conn: SearchConnection) -> AbstractQueryAnalyzer +``` + +The `conn` parameter contains the current search connection. See the +[library documentation](../library/Low-Level-DB-Access.md#searchconnection-class) +for details on the class. The function must return the instance of the tokenizer +class as defined below. + + ### Python Tokenizer Class -All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer` +All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer` and implement the abstract functions defined there. -::: nominatim.tokenizer.base.AbstractTokenizer - rendering: - heading_level: 4 +::: nominatim_db.tokenizer.base.AbstractTokenizer + options: + heading_level: 6 ### Python Analyzer Class -::: nominatim.tokenizer.base.AbstractAnalyzer - rendering: - heading_level: 4 +::: nominatim_db.tokenizer.base.AbstractAnalyzer + options: + heading_level: 6 + + +### Python Query Analyzer Class + +::: nominatim_api.search.query_analyzer_factory.AbstractQueryAnalyzer + options: + heading_level: 6 ### PL/pgSQL Functions @@ -189,6 +212,28 @@ a house number token text. If a place has multiple house numbers they must be listed with a semicolon as delimiter. Must be NULL when the place has no house numbers. +```sql +FUNCTION token_is_street_address(info JSONB) RETURNS BOOLEAN +``` + +Return true if this is an object that should be parented against a street. +Only relevant for objects with address rank 30. + +```sql +FUNCTION token_has_addr_street(info JSONB) RETURNS BOOLEAN +``` + +Return true if there are street names to match against for finding the +parent of the object. + + +```sql +FUNCTION token_has_addr_place(info JSONB) RETURNS BOOLEAN +``` + +Return true if there are place names to match against for finding the +parent of the object. + ```sql FUNCTION token_matches_street(info JSONB, street_tokens INTEGER[]) RETURNS BOOLEAN ``` @@ -245,11 +290,11 @@ Currently, tokenizers are encouraged to make sure that matching works against both the search token list and the match token list. ```sql -FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT +FUNCTION token_get_postcode(info JSONB) RETURNS TEXT ``` -Return the normalized version of the given postcode. This function must return -the same value as the Python function `AbstractAnalyzer->normalize_postcode()`. +Return the postcode for the object, if any exists. The postcode must be in +the form that should also be presented to the end-user. ```sql FUNCTION token_strip_info(info JSONB) RETURNS JSONB @@ -260,73 +305,3 @@ permanently. The indexer calls this function when all processing is done and replaces the content of the `token_info` column with the returned value before the trigger stores the information in the database. May return NULL if no information should be stored permanently. - -### PHP Tokenizer class - -The PHP tokenizer class is instantiated once per request and responsible for -analyzing the incoming query. Multiple requests may be in flight in -parallel. - -The class is expected to be found under the -name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file -`tokenizer/tokenizer.php` in the project directory. This file must be created -when the tokenizer is first set up on import. The file should initialize any -configuration variables by setting PHP constants and then require the file -with the actual implementation of the tokenizer. - -The tokenizer class must implement the following functions: - -```php -public function __construct(object &$oDB) -``` - -The constructor of the class receives a database connection that can be used -to query persistent data in the database. - -```php -public function checkStatus() -``` - -Check that the tokenizer can access its persistent data structures. If there -is an issue, throw an `\Exception`. - -```php -public function normalizeString(string $sTerm) : string -``` - -Normalize string to a form to be used for comparisons when reordering results. -Nominatim reweighs results how well the final display string matches the actual -query. Before comparing result and query, names and query are normalised against -this function. The tokenizer can thus remove all properties that should not be -taken into account for reweighing, e.g. special characters or case. - -```php -public function tokensForSpecialTerm(string $sTerm) : array -``` - -Return the list of special term tokens that match the given term. - -```php -public function extractTokensFromPhrases(array &$aPhrases) : TokenList -``` - -Parse the given phrases, splitting them into word lists and retrieve the -matching tokens. - -The phrase array may take on two forms. In unstructured searches (using `q=` -parameter) the search query is split at the commas and the elements are -put into a sorted list. For structured searches the phrase array is an -associative array where the key designates the type of the term (street, city, -county etc.) The tokenizer may ignore the phrase type at this stage in parsing. -Matching phrase type and appropriate search token type will be done later -when the SearchDescription is built. - -For each phrase in the list of phrases, the function must analyse the phrase -string and then call `setWordSets()` to communicate the result of the analysis. -A word set is a list of strings, where each string refers to a search token. -A phrase may have multiple interpretations. Therefore a list of word sets is -usually attached to the phrase. The search tokens themselves are returned -by the function in an associative array, where the key corresponds to the -strings given in the word sets. The value is a list of search tokens. Thus -a single string in the list of word sets may refer to multiple search tokens. -