docs/develop/Tokenizers.md

   1 # Tokenizers
   2
   3 The tokenizer is the component of Nominatim that is responsible for
   4 analysing names of OSM objects and queries. Nominatim provides different
   5 tokenizers that use different strategies for normalisation. This page describes
   6 how tokenizers are expected to work and the public API that needs to be
   7 implemented when creating a new tokenizer. For information on how to configure
   8 a specific tokenizer for a database see the
   9 [tokenizer chapter in the administration guide](../admin/Tokenizers.md).
  10
  11 ## Generic Architecture
  12
  13 ### About Search Tokens
  14
  15 Search in Nominatim is organised around search tokens. Such a token represents
  16 string that can be part of the search query. Tokens are used so that the search
  17 index does not need to be organised around strings. Instead the database saves
  18 for each place which tokens match this place's name, address, house number etc.
  19 To be able to distinguish between these different types of information stored
  20 with the place, a search token also always has a certain type: name, house number,
  21 postcode etc.
  22
  23 During search an incoming query is transformed into a ordered list of such
  24 search tokens (or rather many lists, see below) and this list is then converted
  25 into a database query to find the right place.
  26
  27 It is the core task of the tokenizer to create, manage and assign the search
  28 tokens. The tokenizer is involved in two distinct operations:
  29
  30 * __at import time__: scanning names of OSM objects, normalizing them and
  31   building up the list of search tokens.
  32 * __at query time__: scanning the query and returning the appropriate search
  33   tokens.
  34
  35
  36 ### Importing
  37
  38 The indexer is responsible to enrich an OSM object (or place) with all data
  39 required for geocoding. It is split into two parts: the controller collects
  40 the places that require updating, enriches the place information as required
  41 and hands the place to Postgresql. The collector is part of the Nominatim
  42 library written in Python. Within Postgresql, the `placex_update`
  43 trigger is responsible to fill out all secondary tables with extra geocoding
  44 information. This part is written in PL/pgSQL.
  45
  46 The tokenizer is involved in both parts. When the indexer prepares a place,
  47 it hands it over to the tokenizer to inspect the names and create all the
  48 search tokens applicable for the place. This usually involves updating the
  49 tokenizer's internal token lists and creating a list of all token IDs for
  50 the specific place. This list is later needed in the PL/pgSQL part where the
  51 indexer needs to add the token IDs to the appropriate search tables. To be
  52 able to communicate the list between the Python part and the pl/pgSQL trigger,
  53 the `placex` table contains a special JSONB column `token_info` which is there
  54 for the exclusive use of the tokenizer.
  55
  56 The Python part of the tokenizer returns a structured information about the
  57 tokens of a place to the indexer which converts it to JSON and inserts it into
  58 the `token_info` column. The content of the column is then handed to the PL/pqSQL
  59 callbacks of the tokenizer which extracts the required information. Usually
  60 the tokenizer then removes all information from the `token_info` structure,
  61 so that no information is ever persistently saved in the table. All information
  62 that went in should have been processed after all and put into secondary tables.
  63 This is however not a hard requirement. If the tokenizer needs to store
  64 additional information about a place permanently, it may do so in the
  65 `token_info` column. It just may never execute searches over it and
  66 consequently not create any special indexes on it.
  67
  68 ### Querying
  69
  70 At query time, Nominatim builds up multiple _interpretations_ of the search
  71 query. Each of these interpretations is tried against the database in order
  72 of the likelihood with which they match to the search query. The first
  73 interpretation that yields results wins.
  74
  75 The interpretations are encapsulated in the `SearchDescription` class. An
  76 instance of this class is created by applying a sequence of
  77 _search tokens_ to an initially empty SearchDescription. It is the
  78 responsibility of the tokenizer to parse the search query and derive all
  79 possible sequences of search tokens. To that end the tokenizer needs to parse
  80 the search query and look up matching words in its own data structures.
  81
  82 ## Tokenizer API
  83
  84 The following section describes the functions that need to be implemented
  85 for a custom tokenizer implementation.
  86
  87 !!! warning
  88     This API is currently in early alpha status. While this API is meant to
  89     be a public API on which other tokenizers may be implemented, the API is
  90     far away from being stable at the moment.
  91
  92 ### Directory Structure
  93
  94 Nominatim expects two files for a tokenizer:
  95
  96 * `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
  97   implementation
  98 * `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
  99   implementation
 100
 101 where `<NAME>` is a unique name for the tokenizer consisting of only lower-case
 102 letters, digits and underscore. A tokenizer also needs to install some SQL
 103 functions. By convention, these should be placed in `lib-sql/tokenizer`.
 104
 105 If the tokenizer has a default configuration file, this should be saved in
 106 the `settings/<NAME>_tokenizer.<SUFFIX>`.
 107
 108 ### Configuration and Persistance
 109
 110 Tokenizers may define custom settings for their configuration. All settings
 111 must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
 112 persistent. Transient settings are loaded from the configuration file when
 113 Nominatim is started and may thus be changed at any time. Persistent settings
 114 are tied to a database installation and must only be read during installation
 115 time. If they are needed for the runtime then they must be saved into the
 116 `nominatim_properties` table and later loaded from there.
 117
 118 ### The Python module
 119
 120 The Python module is expect to export a single factory function:
 121
 122 ```python
 123 def create(dsn: str, data_dir: Path) -> AbstractTokenizer
 124 ```
 125
 126 The `dsn` parameter contains the DSN of the Nominatim database. The `data_dir`
 127 is a directory in the project directory that the tokenizer may use to save
 128 database-specific data. The function must return the instance of the tokenizer
 129 class as defined below.
 130
 131 ### Python Tokenizer Class
 132
 133 All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer`
 134 and implement the abstract functions defined there.
 135
 136 ::: nominatim.tokenizer.base.AbstractTokenizer
 137     rendering:
 138         heading_level: 4
 139
 140 ### Python Analyzer Class
 141
 142 ::: nominatim.tokenizer.base.AbstractAnalyzer
 143     rendering:
 144         heading_level: 4
 145
 146 ### PL/pgSQL Functions
 147
 148 The tokenizer must provide access functions for the `token_info` column
 149 to the indexer which extracts the necessary information for the global
 150 search tables. If the tokenizer needs additional SQL functions for private
 151 use, then these functions must be prefixed with `token_` in order to ensure
 152 that there are no naming conflicts with the SQL indexer code.
 153
 154 The following functions are expected:
 155
 156 ```sql
 157 FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[]
 158 ```
 159
 160 Return an array of token IDs of search terms that should match
 161 the name(s) for the given place. These tokens are used to look up the place
 162 by name and, where the place functions as part of an address for another place,
 163 by address. Must return NULL when the place has no name.
 164
 165 ```sql
 166 FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[]
 167 ```
 168
 169 Return an array of token IDs of full names of the place that should be used
 170 to match addresses. The list of match tokens is usually more strict than
 171 search tokens as it is used to find a match between two OSM tag values which
 172 are expected to contain matching full names. Partial terms should not be
 173 used for match tokens. Must return NULL when the place has no name.
 174
 175 ```sql
 176 FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[]
 177 ```
 178
 179 Return an array of token IDs of house number tokens that apply to the place.
 180 Note that a place may have multiple house numbers, for example when apartments
 181 each have their own number. Must be NULL when the place has no house numbers.
 182
 183 ```sql
 184 FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT
 185 ```
 186
 187 Return the house number(s) in the normalized form that can be matched against
 188 a house number token text. If a place has multiple house numbers they must
 189 be listed with a semicolon as delimiter. Must be NULL when the place has no
 190 house numbers.
 191
 192 ```sql
 193 FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[]
 194 ```
 195
 196 Return the match token IDs by which to search a matching street from the
 197 `addr:street` tag. These IDs will be matched against the IDs supplied by
 198 `token_get_name_match_tokens`. Must be NULL when the place has no `addr:street`
 199 tag.
 200
 201 ```sql
 202 FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[]
 203 ```
 204
 205 Return the match token IDs by which to search a matching place from the
 206 `addr:place` tag. These IDs will be matched against the IDs supplied by
 207 `token_get_name_match_tokens`. Must be NULL when the place has no `addr:place`
 208 tag.
 209
 210 ```sql
 211 FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]
 212 ```
 213
 214 Return the search token IDs extracted from the `addr:place` tag. These tokens
 215 are used for searches by address when no matching place can be found in the
 216 database. Must be NULL when the place has no `addr:place` tag.
 217
 218 ```sql
 219 CREATE TYPE token_addresstoken AS (
 220   key TEXT,
 221   match_tokens INT[],
 222   search_tokens INT[]
 223 );
 224
 225 FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken
 226 ```
 227
 228 Return the match and search token IDs for explicit `addr:*` tags for the place
 229 other than `addr:street` and `addr:place`. For each address item there are
 230 three pieces of information returned:
 231
 232  * _key_ contains the type of address item (city, county, etc.). This is the
 233    key handed in with the `address` dictionary.
 234  * *match_tokens* is the list of token IDs used to find the corresponding
 235    place object for the address part. The list is matched against the IDs
 236    from `token_get_name_match_tokens`.
 237  * *search_tokens* is the list of token IDs under which to search the address
 238    item. It is used when no corresponding place object was found.
 239
 240 ```sql
 241 FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
 242 ```
 243
 244 Return the normalized version of the given postcode. This function must return
 245 the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
 246
 247 ```sql
 248 FUNCTION token_strip_info(info JSONB) RETURNS JSONB
 249 ```
 250
 251 Return the part of the `token_info` field that should be stored in the database
 252 permanently. The indexer calls this function when all processing is done and
 253 replaces the content of the `token_info` column with the returned value before
 254 the trigger stores the information in the database. May return NULL if no
 255 information should be stored permanently.
 256
 257 ### PHP Tokenizer class
 258
 259 The PHP tokenizer class is instantiated once per request and responsible for
 260 analyzing the incoming query. Multiple requests may be in flight in
 261 parallel.
 262
 263 The class is expected to be found under the
 264 name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file
 265 `tokenizer/tokenizer.php` in the project directory. This file must be created
 266 when the tokenizer is first set up on import. The file should initialize any
 267 configuration variables by setting PHP constants and then require the file
 268 with the actual implementation of the tokenizer.
 269
 270 The tokenizer class must implement the following functions:
 271
 272 ```php
 273 public function __construct(object &$oDB)
 274 ```
 275
 276 The constructor of the class receives a database connection that can be used
 277 to query persistent data in the database.
 278
 279 ```php
 280 public function checkStatus()
 281 ```
 282
 283 Check that the tokenizer can access its persistent data structures. If there
 284 is an issue, throw an `\Exception`.
 285
 286 ```php
 287 public function normalizeString(string $sTerm) : string
 288 ```
 289
 290 Normalize string to a form to be used for comparisons when reordering results.
 291 Nominatim reweighs results how well the final display string matches the actual
 292 query. Before comparing result and query, names and query are normalised against
 293 this function. The tokenizer can thus remove all properties that should not be
 294 taken into account for reweighing, e.g. special characters or case.
 295
 296 ```php
 297 public function tokensForSpecialTerm(string $sTerm) : array
 298 ```
 299
 300 Return the list of special term tokens that match the given term.
 301
 302 ```php
 303 public function extractTokensFromPhrases(array &$aPhrases) : TokenList
 304 ```
 305
 306 Parse the given phrases, splitting them into word lists and retrieve the
 307 matching tokens.
 308
 309 The phrase array may take on two forms. In unstructured searches (using `q=`
 310 parameter) the search query is split at the commas and the elements are
 311 put into a sorted list. For structured searches the phrase array is an
 312 associative array where the key designates the type of the term (street, city,
 313 county etc.) The tokenizer may ignore the phrase type at this stage in parsing.
 314 Matching phrase type and appropriate search token type will be done later
 315 when the SearchDescription is built.
 316
 317 For each phrase in the list of phrases, the function must analyse the phrase
 318 string and then call `setWordSets()` to communicate the result of the analysis.
 319 A word set is a list of strings, where each string refers to a search token.
 320 A phrase may have multiple interpretations. Therefore a list of word sets is
 321 usually attached to the phrase. The search tokens themselves are returned
 322 by the function in an associative array, where the key corresponds to the
 323 strings given in the word sets. The value is a list of search tokens. Thus
 324 a single string in the list of word sets may refer to multiple search tokens.
 325