X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/34d27ed45cac67fd155f6997d82fa6a5d8946ce9..7aa0aba382ae97fd986f8cbe33d6f2c2ff22eb9d:/docs/develop/ICU-Tokenizer-Modules.md?ds=sidebyside diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md index 51b189f1..2cf30a56 100644 --- a/docs/develop/ICU-Tokenizer-Modules.md +++ b/docs/develop/ICU-Tokenizer-Modules.md @@ -5,7 +5,14 @@ highly customizable method to pre-process and normalize the name information of the input data before it is added to the search index. It comes with a selection of sanitizers and token analyzers which you can use to adapt your installation to your needs. If the provided modules are not enough, you can -also provide your own implementations. This section describes how to do that. +also provide your own implementations. This section describes the API +of sanitizers and token analysis. + +!!! warning + This API is currently in early alpha status. While this API is meant to + be a public API on which other sanitizers and token analyzers may be + implemented, it is not guaranteed to be stable at the moment. + ## Using non-standard sanitizers and token analyzers @@ -50,9 +57,9 @@ the function. show_source: no heading_level: 6 -### The sanitation function +### The main filter function of the sanitizer -The sanitation function receives a single object of type `ProcessInfo` +The filter function receives a single object of type `ProcessInfo` which has with three members: * `place`: read-only information about the place being processed. @@ -82,11 +89,60 @@ adding extra attributes) or completely replace the list with a different one. show_source: no heading_level: 6 -## Custom token analysis module -Setup of a token analyser is split into two parts: configuration and -analyser factory. A token analysis module must therefore implement two -functions: +### Example: Filter for US street prefixes + +The following sanitizer removes the directional prefixes from street names +in the US: + +``` python +import re + +def _filter_function(obj): + if obj.place.country_code == 'us' \ + and obj.place.rank_address >= 26 and obj.place.rank_address <= 27: + for name in obj.names: + name.name = re.sub(r'^(north|south|west|east) ', + '', + name.name, + flags=re.IGNORECASE) + +def create(config): + return _filter_function +``` + +This is the most simple form of a sanitizer module. If defines a single +filter function and implements the required `create()` function by returning +the filter. + +The filter function first checks if the object is interesting for the +sanitizer. Namely it checks if the place is in the US (through `country_code`) +and it the place is a street (a `rank_address` of 26 or 27). If the +conditions are met, then it goes through all available names and +removes any leading directional prefix using a simple regular expression. + +Save the source code in a file in your project directory, for example as +`us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`: + +``` yaml +... +sanitizers: + - step: us_streets.py +... +``` + +!!! warning + This example is just a simplified show case on how to create a sanitizer. + It is not really read for real-world use: while the sanitizer would + correcly transform `West 5th Street` into `5th Street`. it would also + shorten a simple `North Street` to `Street`. + +For more sanitizer examples, have a look at the sanitizers provided by Nominatim. +They can be found in the directory +[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers). + + +## Custom token analysis module ::: nominatim.tokenizer.token_analysis.base.AnalysisModule rendering: @@ -98,3 +154,74 @@ functions: rendering: show_source: no heading_level: 6 + +### Example: Creating acronym variants for long names + +The following example of a token analysis module creates acronyms from +very long names and adds them as a variant: + +``` python +class AcronymMaker: + """ This class is the actual analyzer. + """ + def __init__(self, norm, trans): + self.norm = norm + self.trans = trans + + + def get_canonical_id(self, name): + # In simple cases, the normalized name can be used as a canonical id. + return self.norm.transliterate(name.name).strip() + + + def compute_variants(self, name): + # The transliterated form of the name always makes up a variant. + variants = [self.trans.transliterate(name)] + + # Only create acronyms from very long words. + if len(name) > 20: + # Take the first letter from each word to form the acronym. + acronym = ''.join(w[0] for w in name.split()) + # If that leds to an acronym with at least three letters, + # add the resulting acronym as a variant. + if len(acronym) > 2: + # Never forget to transliterate the variants before returning them. + variants.append(self.trans.transliterate(acronym)) + + return variants + +# The following two functions are the module interface. + +def configure(rules, normalizer, transliterator): + # There is no configuration to parse and no data to set up. + # Just return an empty configuration. + return None + + +def create(normalizer, transliterator, config): + # Return a new instance of our token analysis class above. + return AcronymMaker(normalizer, transliterator) +``` + +Given the name `Trans-Siberian Railway`, the code above would return the full +name `Trans-Siberian Railway` and the acronym `TSR` as variant, so that +searching would work for both. + +## Sanitizers vs. Token analysis - what to use for variants? + +It is not always clear when to implement variations in the sanitizer and +when to write a token analysis module. Just take the acronym example +above: it would also have been possible to write a sanitizer which adds the +acronym as an additional name to the name list. The result would have been +similar. So which should be used when? + +The most important thing to keep in mind is that variants created by the +token analysis are only saved in the word lookup table. They do not need +extra space in the search index. If there are many spelling variations, this +can mean quite a significant amount of space is saved. + +When creating additional names with a sanitizer, these names are completely +independent. In particular, they can be fed into different token analysis +modules. This gives a much greater flexibility but at the price that the +additional names increase the size of the search index. +