X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/e7574f119eaab63723165f8139455d8af365a21e..3127d59613c54c58a77784ece4c0e2de02d5a282:/docs/develop/ICU-Tokenizer-Modules.md diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md index 2427ab11..daadf899 100644 --- a/docs/develop/ICU-Tokenizer-Modules.md +++ b/docs/develop/ICU-Tokenizer-Modules.md @@ -53,27 +53,38 @@ the function. ### Sanitizer configuration ::: nominatim.tokenizer.sanitizers.config.SanitizerConfig - rendering: - show_source: no + options: heading_level: 6 -### The sanitation function +### The main filter function of the sanitizer -The sanitation function receives a single object of type `ProcessInfo` +The filter function receives a single object of type `ProcessInfo` which has with three members: - * `place`: read-only information about the place being processed. + * `place: PlaceInfo`: read-only information about the place being processed. See PlaceInfo below. - * `names`: The current list of names for the place. Each name is a - PlaceName object. - * `address`: The current list of address names for the place. Each name - is a PlaceName object. + * `names: List[PlaceName]`: The current list of names for the place. + * `address: List[PlaceName]`: The current list of address names for the place. While the `place` member is provided for information only, the `names` and `address` lists are meant to be manipulated by the sanitizer. It may add and remove entries, change information within a single entry (for example by adding extra attributes) or completely replace the list with a different one. +#### PlaceInfo - information about the place + +::: nominatim.data.place_info.PlaceInfo + options: + heading_level: 6 + + +#### PlaceName - extended naming information + +::: nominatim.data.place_name.PlaceName + options: + heading_level: 6 + + ### Example: Filter for US street prefixes The following sanitizer removes the directional prefixes from street names @@ -102,58 +113,39 @@ the filter. The filter function first checks if the object is interesting for the sanitizer. Namely it checks if the place is in the US (through `country_code`) and it the place is a street (a `rank_address` of 26 or 27). If the -conditions are met, then it goes through all available names and replaces -any removes any leading direction prefix using a simple regular expression. +conditions are met, then it goes through all available names and +removes any leading directional prefix using a simple regular expression. Save the source code in a file in your project directory, for example as `us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`: -``` +``` yaml ... sanitizers: - step: us_streets.py ... ``` -For more sanitizer examples, have a look at the sanitizers provided by Nominatim. -They can be found in the directory `nominatim/tokenizer/sanitizers`. - !!! warning This example is just a simplified show case on how to create a sanitizer. It is not really read for real-world use: while the sanitizer would - correcly transform `West 5th Street` into `5th Street`. it would also + correctly transform `West 5th Street` into `5th Street`. it would also shorten a simple `North Street` to `Street`. -#### PlaceInfo - information about the place - -::: nominatim.data.place_info.PlaceInfo - rendering: - show_source: no - heading_level: 6 - - -#### PlaceName - extended naming information +For more sanitizer examples, have a look at the sanitizers provided by Nominatim. +They can be found in the directory +[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers). -::: nominatim.data.place_name.PlaceName - rendering: - show_source: no - heading_level: 6 ## Custom token analysis module -Setup of a token analyser is split into two parts: configuration and -analyser factory. A token analysis module must therefore implement two -functions: - ::: nominatim.tokenizer.token_analysis.base.AnalysisModule - rendering: - show_source: no + options: heading_level: 6 ::: nominatim.tokenizer.token_analysis.base.Analyzer - rendering: - show_source: no + options: heading_level: 6 ### Example: Creating acronym variants for long names