docs/develop/Tokenizers.md

   1 # Tokenizers
   2
   3 The tokenizer is the component of Nominatim that is responsible for
   4 analysing names of OSM objects and queries. Nominatim provides different
   5 tokenizers that use different strategies for normalisation. This page describes
   6 how tokenizers are expected to work and the public API that needs to be
   7 implemented when creating a new tokenizer. For information on how to configure
   8 a specific tokenizer for a database see the
   9 [tokenizer chapter in the administration guide](../admin/Tokenizers.md).
  10
  11 ## Generic Architecture
  12
  13 ### About Search Tokens
  14
  15 Search in Nominatim is organised around search tokens. Such a token represents
  16 string that can be part of the search query. Tokens are used so that the search
  17 index does not need to be organised around strings. Instead the database saves
  18 for each place which tokens match this place's name, address, house number etc.
  19 To be able to distinguish between these different types of information stored
  20 with the place, a search token also always has a certain type: name, house number,
  21 postcode etc.
  22
  23 During search an incoming query is transformed into a ordered list of such
  24 search tokens (or rather many lists, see below) and this list is then converted
  25 into a database query to find the right place.
  26
  27 It is the core task of the tokenizer to create, manage and assign the search
  28 tokens. The tokenizer is involved in two distinct operations:
  29
  30 * __at import time__: scanning names of OSM objects, normalizing them and
  31   building up the list of search tokens.
  32 * __at query time__: scanning the query and returning the appropriate search
  33   tokens.
  34
  35
  36 ### Importing
  37
  38 The indexer is responsible to enrich an OSM object (or place) with all data
  39 required for geocoding. It is split into two parts: the controller collects
  40 the places that require updating, enriches the place information as required
  41 and hands the place to Postgresql. The collector is part of the Nominatim
  42 library written in Python. Within Postgresql, the `placex_update`
  43 trigger is responsible to fill out all secondary tables with extra geocoding
  44 information. This part is written in PL/pgSQL.
  45
  46 The tokenizer is involved in both parts. When the indexer prepares a place,
  47 it hands it over to the tokenizer to inspect the names and create all the
  48 search tokens applicable for the place. This usually involves updating the
  49 tokenizer's internal token lists and creating a list of all token IDs for
  50 the specific place. This list is later needed in the PL/pgSQL part where the
  51 indexer needs to add the token IDs to the appropriate search tables. To be
  52 able to communicate the list between the Python part and the pl/pgSQL trigger,
  53 the placex table contains a special JSONB column `token_info` which is there
  54 for the exclusive use of the tokenizer.
  55
  56 The Python part of the tokenizer returns a structured information about the
  57 tokens of a place to the indexer which converts it to JSON and inserts it into
  58 the `token_info` column. The content of the column is then handed to the PL/pqSQL
  59 callbacks of the tokenizer which extracts the required information. Usually
  60 the tokenizer then removes all information from the `token_info` structure,
  61 so that no information is ever persistently saved in the table. All information
  62 that went in should have been processed after all and put into secondary tables.
  63 This is however not a hard requirement. If the tokenizer needs to store
  64 additional information about a place permanently, it may do so in the
  65 `token_info` column. It just may never execute searches over it and
  66 consequently not create any special indexes on it.
  67
  68 ### Querying
  69
  70 The tokenizer is responsible for the initial parsing of the query. It needs
  71 to split the query into appropriate words and terms and match them against
  72 the saved tokens in the database. It then returns the list of possibly matching
  73 tokens and the list of possible splits to the query parser. The parser uses
  74 this information to compute all possible interpretations of the query and
  75 rank them accordingly.
  76
  77 ## Tokenizer API
  78
  79 The following section describes the functions that need to be implemented
  80 for a custom tokenizer implementation.
  81
  82 !!! warning
  83     This API is currently in early alpha status. While this API is meant to
  84     be a public API on which other tokenizers may be implemented, the API is
  85     far away from being stable at the moment.
  86
  87 ### Directory Structure
  88
  89 Nominatim expects two files for a tokenizer:
  90
  91 * `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
  92   implementation
  93 * `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
  94   implementation
  95
  96 where `<NAME>` is a unique name for the tokenizer consisting of only lower-case
  97 letters, digits and underscore. A tokenizer also needs to install some SQL
  98 functions. By convention, these should be placed in `lib-sql/tokenizer`.
  99
 100 If the tokenizer has a default configuration file, this should be saved in
 101 the `settings/<NAME>_tokenizer.<SUFFIX>`.
 102
 103 ### Configuration and Persistance
 104
 105 Tokenizers may define custom settings for their configuration. All settings
 106 must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
 107 persistent. Transient settings are loaded from the configuration file when
 108 Nominatim is started and may thus be changed at any time. Persistent settings
 109 are tied to a database installation and must only be read during installation
 110 time. If they are needed for the runtime then they must be saved into the
 111 `nominatim_properties` table and later loaded from there.
 112
 113 ### The Python module
 114
 115 The Python module is expect to export a single factory function:
 116
 117 ```python
 118 def create(dsn: str, data_dir: Path) -> AbstractTokenizer
 119 ```
 120
 121 The `dsn` parameter contains the DSN of the Nominatim database. The `data_dir`
 122 is a directory in the project directory that the tokenizer may use to save
 123 database-specific data. The function must return the instance of the tokenizer
 124 class as defined below.
 125
 126 ### Python Tokenizer Class
 127
 128 All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer`
 129 and implement the abstract functions defined there.
 130
 131 ::: nominatim.tokenizer.base.AbstractTokenizer
 132     rendering:
 133         heading_level: 4
 134
 135 ### Python Analyzer Class
 136
 137 ::: nominatim.tokenizer.base.AbstractAnalyzer
 138     rendering:
 139         heading_level: 4
 140
 141 ### PL/pgSQL Functions
 142
 143 The tokenizer must provide access functions for the `token_info` column
 144 to the indexer which extracts the necessary information for the global
 145 search tables. If the tokenizer needs additional SQL functions for private
 146 use, then these functions must be prefixed with `token_` in order to ensure
 147 that there are no naming conflicts with the SQL indexer code.
 148
 149 The following functions are expected:
 150
 151 ```sql
 152 FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[]
 153 ```
 154
 155 Return an array of token IDs of search terms that should match
 156 the name(s) for the given place. These tokens are used to look up the place
 157 by name and, where the place functions as part of an address for another place,
 158 by address. Must return NULL when the place has no name.
 159
 160 ```sql
 161 FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[]
 162 ```
 163
 164 Return an array of token IDs of full names of the place that should be used
 165 to match addresses. The list of match tokens is usually more strict than
 166 search tokens as it is used to find a match between two OSM tag values which
 167 are expected to contain matching full names. Partial terms should not be
 168 used for match tokens. Must return NULL when the place has no name.
 169
 170 ```sql
 171 FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[]
 172 ```
 173
 174 Return an array of token IDs of house number tokens that apply to the place.
 175 Note that a place may have multiple house numbers, for example when apartments
 176 each have their own number. Must be NULL when the place has no house numbers.
 177
 178 ```sql
 179 FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT
 180 ```
 181
 182 Return the house number(s) in the normalized form that can be matched against
 183 a house number token text. If a place has multiple house numbers they must
 184 be listed with a semicolon as delimiter. Must be NULL when the place has no
 185 house numbers.
 186
 187 ```sql
 188 FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[]
 189 ```
 190
 191 Return the match token IDs by which to search a matching street from the
 192 `addr:street` tag. These IDs will be matched against the IDs supplied by
 193 `token_get_name_match_tokens`. Must be NULL when the place has no `addr:street`
 194 tag.
 195
 196 ```sql
 197 FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[]
 198 ```
 199
 200 Return the match token IDs by which to search a matching place from the
 201 `addr:place` tag. These IDs will be matched against the IDs supplied by
 202 `token_get_name_match_tokens`. Must be NULL when the place has no `addr:place`
 203 tag.
 204
 205 ```sql
 206 FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]
 207 ```
 208
 209 Return the search token IDs extracted from the `addr:place` tag. These tokens
 210 are used for searches by address when no matching place can be found in the
 211 database. Must be NULL when the place has no `addr:place` tag.
 212
 213 ```sql
 214 CREATE TYPE token_addresstoken AS (
 215   key TEXT,
 216   match_tokens INT[],
 217   search_tokens INT[]
 218 );
 219
 220 FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken
 221 ```
 222
 223 Return the match and search token IDs for explicit `addr:*` tags for the place
 224 other than `addr:street` and `addr:place`. For each address item there are
 225 three pieces of information returned:
 226
 227  * _key_ contains the type of address item (city, county, etc.). This is the
 228    key handed in with the `address` dictionary.
 229  * *match_tokens* is the list of token IDs used to find the corresponding
 230    place object for the address part. The list is matched against the IDs
 231    from `token_get_name_match_tokens`.
 232  * *search_tokens* is the list of token IDs under which to search the address
 233    item. It is used when no corresponding place object was found.
 234
 235 ```sql
 236 FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
 237 ```
 238
 239 Return the normalized version of the given postcode. This function must return
 240 the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
 241
 242 ```sql
 243 FUNCTION token_strip_info(info JSONB) RETURNS JSONB
 244 ```
 245
 246 Return the part of the `token_info` field that should be stored in the database
 247 permanently. The indexer calls this function when all processing is done and
 248 replaces the content of the `token_info` column with the returned value before
 249 the trigger stores the information in the database. May return NULL if no
 250 information should be stored permanently.