docs/customize/Tokenizers.md

   1 # Tokenizers
   2
   3 The tokenizer module in Nominatim is responsible for analysing the names given
   4 to OSM objects and the terms of an incoming query in order to make sure, they
   5 can be matched appropriately.
   6
   7 Nominatim offers different tokenizer modules, which behave differently and have
   8 different configuration options. This sections describes the tokenizers and how
   9 they can be configured.
  10
  11 !!! important
  12     The use of a tokenizer is tied to a database installation. You need to choose
  13     and configure the tokenizer before starting the initial import. Once the import
  14     is done, you cannot switch to another tokenizer anymore. Reconfiguring the
  15     chosen tokenizer is very limited as well. See the comments in each tokenizer
  16     section.
  17
  18 ## ICU tokenizer
  19
  20 The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
  21 normalize names and queries. It also offers configurable decomposition and
  22 abbreviation handling.
  23 This tokenizer is currently the default.
  24
  25 To enable the tokenizer add the following line to your project configuration:
  26
  27 ```
  28 NOMINATIM_TOKENIZER=icu
  29 ```
  30
  31 ### How it works
  32
  33 On import the tokenizer processes names in the following three stages:
  34
  35 1. During the **Sanitizer step** incoming names are cleaned up and converted to
  36    **full names**. This step can be used to regularize spelling, split multi-name
  37    tags into their parts and tag names with additional attributes. See the
  38    [Sanitizers section](#sanitizers) below for available cleaning routines.
  39 2. The **Normalization** part removes all information from the full names
  40    that are not relevant for search.
  41 3. The **Token analysis** step takes the normalized full names and creates
  42    all transliterated variants under which the name should be searchable.
  43    See the [Token analysis](#token-analysis) section below for more
  44    information.
  45
  46 During query time, only normalization and transliteration are relevant.
  47 An incoming query is first split into name chunks (this usually means splitting
  48 the string at the commas) and the each part is normalised and transliterated.
  49 The result is used to look up places in the search index.
  50
  51 ### Configuration
  52
  53 The ICU tokenizer is configured using a YAML file which can be configured using
  54 `NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
  55 saved as part of the internal database status. Later changes to the variable
  56 have no effect.
  57
  58 Here is an example configuration file:
  59
  60 ``` yaml
  61 normalization:
  62     - ":: lower ()"
  63     - "ß > 'ss'" # German szet is unambiguously equal to double ss
  64 transliteration:
  65     - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
  66     - ":: Ascii ()"
  67 sanitizers:
  68     - step: split-name-list
  69 token-analysis:
  70     - analyzer: generic
  71       variants:
  72           - !include icu-rules/variants-ca.yaml
  73           - words:
  74               - road -> rd
  75               - bridge -> bdge,br,brdg,bri,brg
  76       mutations:
  77           - pattern: 'ä'
  78             replacements: ['ä', 'ae']
  79 ```
  80
  81 The configuration file contains four sections:
  82 `normalization`, `transliteration`, `sanitizers` and `token-analysis`.
  83
  84 #### Normalization and Transliteration
  85
  86 The normalization and transliteration sections each define a set of
  87 ICU rules that are applied to the names.
  88
  89 The **normalization** rules are applied after sanitation. They should remove
  90 any information that is not relevant for search at all. Usual rules to be
  91 applied here are: lower-casing, removing of special characters, cleanup of
  92 spaces.
  93
  94 The **transliteration** rules are applied at the end of the tokenization
  95 process to transfer the name into an ASCII representation. Transliteration can
  96 be useful to allow for further fuzzy matching, especially between different
  97 scripts.
  98
  99 Each section must contain a list of
 100 [ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
 101 The rules are applied in the order in which they appear in the file.
 102 You can also include additional rules from external yaml file using the
 103 `!include` tag. The included file must contain a valid YAML list of ICU rules
 104 and may again include other files.
 105
 106 !!! warning
 107     The ICU rule syntax contains special characters that conflict with the
 108     YAML syntax. You should therefore always enclose the ICU rules in
 109     double-quotes.
 110
 111 #### Sanitizers
 112
 113 The sanitizers section defines an ordered list of functions that are applied
 114 to the name and address tags before they are further processed by the tokenizer.
 115 They allows to clean up the tagging and bring it to a standardized form more
 116 suitable for building the search index.
 117
 118 !!! hint
 119     Sanitizers only have an effect on how the search index is built. They
 120     do not change the information about each place that is saved in the
 121     database. In particular, they have no influence on how the results are
 122     displayed. The returned results always show the original information as
 123     stored in the OpenStreetMap database.
 124
 125 Each entry contains information of a sanitizer to be applied. It has a
 126 mandatory parameter `step` which gives the name of the sanitizer. Depending
 127 on the type, it may have additional parameters to configure its operation.
 128
 129 The order of the list matters. The sanitizers are applied exactly in the order
 130 that is configured. Each sanitizer works on the results of the previous one.
 131
 132 The following is a list of sanitizers that are shipped with Nominatim.
 133
 134 ##### split-name-list
 135
 136 ::: nominatim_db.tokenizer.sanitizers.split_name_list
 137     options:
 138         members: False
 139         heading_level: 6
 140         docstring_section_style: spacy
 141
 142 ##### strip-brace-terms
 143
 144 ::: nominatim_db.tokenizer.sanitizers.strip_brace_terms
 145     options:
 146         members: False
 147         heading_level: 6
 148         docstring_section_style: spacy
 149
 150 ##### tag-analyzer-by-language
 151
 152 ::: nominatim_db.tokenizer.sanitizers.tag_analyzer_by_language
 153     options:
 154         members: False
 155         heading_level: 6
 156         docstring_section_style: spacy
 157
 158 ##### clean-housenumbers
 159
 160 ::: nominatim_db.tokenizer.sanitizers.clean_housenumbers
 161     options:
 162         members: False
 163         heading_level: 6
 164         docstring_section_style: spacy
 165
 166 ##### clean-postcodes
 167
 168 ::: nominatim_db.tokenizer.sanitizers.clean_postcodes
 169     options:
 170         members: False
 171         heading_level: 6
 172         docstring_section_style: spacy
 173
 174 ##### clean-tiger-tags
 175
 176 ::: nominatim_db.tokenizer.sanitizers.clean_tiger_tags
 177     options:
 178         members: False
 179         heading_level: 6
 180         docstring_section_style: spacy
 181
 182 #### delete-tags
 183
 184 ::: nominatim_db.tokenizer.sanitizers.delete_tags
 185     options:
 186         members: False
 187         heading_level: 6
 188         docstring_section_style: spacy
 189
 190 #### tag-japanese
 191
 192 ::: nominatim_db.tokenizer.sanitizers.tag_japanese
 193     options:
 194         members: False
 195         heading_level: 6
 196         docstring_section_style: spacy
 197
 198 #### Token Analysis
 199
 200 Token analyzers take a full name and transform it into one or more normalized
 201 form that are then saved in the search index. In its simplest form, the
 202 analyzer only applies the transliteration rules. More complex analyzers
 203 create additional spelling variants of a name. This is useful to handle
 204 decomposition and abbreviation.
 205
 206 The ICU tokenizer may use different analyzers for different names. To select
 207 the analyzer to be used, the name must be tagged with the `analyzer` attribute
 208 by a sanitizer (see for example the
 209 [tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
 210
 211 The token-analysis section contains the list of configured analyzers. Each
 212 analyzer must have an `id` parameter that uniquely identifies the analyzer.
 213 The only exception is the default analyzer that is used when no special
 214 analyzer was selected. There are analysers with special ids:
 215
 216  * '@housenumber'. If an analyzer with that name is present, it is used
 217    for normalization of house numbers.
 218  * '@potcode'. If an analyzer with that name is present, it is used
 219    for normalization of postcodes.
 220
 221 Different analyzer implementations may exist. To select the implementation,
 222 the `analyzer` parameter must be set. The different implementations are
 223 described in the following.
 224
 225 ##### Generic token analyzer
 226
 227 The generic analyzer `generic` is able to create variants from a list of given
 228 abbreviation and decomposition replacements and introduce spelling variations.
 229
 230 ###### Variants
 231
 232 The optional 'variants' section defines lists of replacements which create alternative
 233 spellings of a name. To create the variants, a name is scanned from left to
 234 right and the longest matching replacement is applied until the end of the
 235 string is reached.
 236
 237 The variants section must contain a list of replacement groups. Each group
 238 defines a set of properties that describes where the replacements are
 239 applicable. In addition, the word section defines the list of replacements
 240 to be made. The basic replacement description is of the form:
 241
 242 ```
 243 <source>[,<source>[...]] => <target>[,<target>[...]]
 244 ```
 245
 246 The left side contains one or more `source` terms to be replaced. The right side
 247 lists one or more replacements. Each source is replaced with each replacement
 248 term.
 249
 250 !!! tip
 251     The source and target terms are internally normalized using the
 252     normalization rules given in the configuration. This ensures that the
 253     strings match as expected. In fact, it is better to use unnormalized
 254     words in the configuration because then it is possible to change the
 255     rules for normalization later without having to adapt the variant rules.
 256
 257 ###### Decomposition
 258
 259 In its standard form, only full words match against the source. There
 260 is a special notation to match the prefix and suffix of a word:
 261
 262 ``` yaml
 263 - ~strasse => str  # matches "strasse" as full word and in suffix position
 264 - hinter~ => hntr  # matches "hinter" as full word and in prefix position
 265 ```
 266
 267 There is no facility to match a string in the middle of the word. The suffix
 268 and prefix notation automatically trigger the decomposition mode: two variants
 269 are created for each replacement, one with the replacement attached to the word
 270 and one separate. So in above example, the tokenization of "hauptstrasse" will
 271 create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
 272 triggers the variants "rote str" and "rotestr". By having decomposition work
 273 both ways, it is sufficient to create the variants at index time. The variant
 274 rules are not applied at query time.
 275
 276 To avoid automatic decomposition, use the '|' notation:
 277
 278 ``` yaml
 279 - ~strasse |=> str
 280 ```
 281
 282 simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
 283
 284 ###### Initial and final terms
 285
 286 It is also possible to restrict replacements to the beginning and end of a
 287 name:
 288
 289 ``` yaml
 290 - ^south => s  # matches only at the beginning of the name
 291 - road$ => rd  # matches only at the end of the name
 292 ```
 293
 294 So the first example would trigger a replacement for "south 45th street" but
 295 not for "the south beach restaurant".
 296
 297 ###### Replacements vs. variants
 298
 299 The replacement syntax `source => target` works as a pure replacement. It changes
 300 the name instead of creating a variant. To create an additional version, you'd
 301 have to write `source => source,target`. As this is a frequent case, there is
 302 a shortcut notation for it:
 303
 304 ```
 305 <source>[,<source>[...]] -> <target>[,<target>[...]]
 306 ```
 307
 308 The simple arrow causes an additional variant to be added. Note that
 309 decomposition has an effect here on the source as well. So a rule
 310
 311 ``` yaml
 312 - "~strasse -> str"
 313 ```
 314
 315 means that for a word like `hauptstrasse` four variants are created:
 316 `hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
 317
 318 ###### Mutations
 319
 320 The 'mutation' section in the configuration describes an additional set of
 321 replacements to be applied after the variants have been computed.
 322
 323 Each mutation is described by two parameters: `pattern` and `replacements`.
 324 The pattern must contain a single regular expression to search for in the
 325 variant name. The regular expressions need to follow the syntax for
 326 [Python regular expressions](file:///usr/share/doc/python3-doc/html/library/re.html#regular-expression-syntax).
 327 Capturing groups are not permitted.
 328 `replacements` must contain a list of strings that the pattern
 329 should be replaced with. Each occurrence of the pattern is replaced with
 330 all given replacements. Be mindful of combinatorial explosion of variants.
 331
 332 ###### Modes
 333
 334 The generic analyser supports a special mode `variant-only`. When configured
 335 then it consumes the input token and emits only variants (if any exist). Enable
 336 the mode by adding:
 337
 338 ```
 339   mode: variant-only
 340 ```
 341
 342 to the analyser configuration.
 343
 344 ##### Housenumber token analyzer
 345
 346 The analyzer `housenumbers` is purpose-made to analyze house numbers. It
 347 creates variants with optional spaces between numbers and letters. Thus,
 348 house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.
 349
 350 The analyzer cannot be customized.
 351
 352 ##### Postcode token analyzer
 353
 354 The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
 355 a 'lookup' variant of the token, which produces variants with optional
 356 spaces. Use together with the clean-postcodes sanitizer.
 357
 358 The analyzer cannot be customized.
 359
 360 ### Reconfiguration
 361
 362 Changing the configuration after the import is currently not possible, although
 363 this feature may be added at a later time.