X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/e7574f119eaab63723165f8139455d8af365a21e..ad214753fcdd790916aa1b3b35e679616f4eeb07:/docs/develop/ICU-Tokenizer-Modules.md diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md index 2427ab11..f19002c2 100644 --- a/docs/develop/ICU-Tokenizer-Modules.md +++ b/docs/develop/ICU-Tokenizer-Modules.md @@ -14,10 +14,11 @@ of sanitizers and token analysis. implemented, it is not guaranteed to be stable at the moment. -## Using non-standard sanitizers and token analyzers +## Using non-standard modules -Sanitizer names (in the `step` property) and token analysis names (in the -`analyzer`) may refer to externally supplied modules. There are two ways +Sanitizer names (in the `step` property), token analysis names (in the +`analyzer`) and query preprocessor names (in the `step` property) +may refer to externally supplied modules. There are two ways to include external modules: through a library or from the project directory. To include a module from a library, use the absolute import path as name and @@ -27,6 +28,47 @@ To use a custom module without creating a library, you can put the module somewhere in your project directory and then use the relative path to the file. Include the whole name of the file including the `.py` ending. +## Custom query preprocessors + +A query preprocessor must export a single factory function `create` with +the following signature: + +``` python +create(self, config: QueryConfig) -> Callable[[list[Phrase]], list[Phrase]] +``` + +The function receives the custom configuration for the preprocessor and +returns a callable (function or class) with the actual preprocessing +code. When a query comes in, then the callable gets a list of phrases +and needs to return the transformed list of phrases. The list and phrases +may be changed in place or a completely new list may be generated. + +The `QueryConfig` is a simple dictionary which contains all configuration +options given in the yaml configuration of the ICU tokenizer. It is up to +the function to interpret the values. + +A `nominatim_api.search.Phrase` describes a part of the query that contains one or more independent +search terms. Breaking a query into phrases helps reducing the number of +possible tokens Nominatim has to take into account. However a phrase break +is definitive: a multi-term search word cannot go over a phrase break. +A Phrase object has two fields: + + * `ptype` further refines the type of phrase (see list below) + * `text` contains the query text for the phrase + +The order of phrases matters to Nominatim when doing further processing. +Thus, while you may split or join phrases, you should not reorder them +unless you really know what you are doing. + +Phrase types (`nominatim_api.search.PhraseType`) can further help narrowing +down how the tokens in the phrase are interpreted. The following phrase types +are known: + +::: nominatim_api.search.PhraseType + options: + heading_level: 6 + + ## Custom sanitizer modules A sanitizer module must export a single factory function `create` with the @@ -52,48 +94,60 @@ the function. ### Sanitizer configuration -::: nominatim.tokenizer.sanitizers.config.SanitizerConfig - rendering: - show_source: no +::: nominatim_db.tokenizer.sanitizers.config.SanitizerConfig + options: heading_level: 6 -### The sanitation function +### The main filter function of the sanitizer -The sanitation function receives a single object of type `ProcessInfo` +The filter function receives a single object of type `ProcessInfo` which has with three members: - * `place`: read-only information about the place being processed. + * `place: PlaceInfo`: read-only information about the place being processed. See PlaceInfo below. - * `names`: The current list of names for the place. Each name is a - PlaceName object. - * `address`: The current list of address names for the place. Each name - is a PlaceName object. + * `names: List[PlaceName]`: The current list of names for the place. + * `address: List[PlaceName]`: The current list of address names for the place. While the `place` member is provided for information only, the `names` and `address` lists are meant to be manipulated by the sanitizer. It may add and remove entries, change information within a single entry (for example by adding extra attributes) or completely replace the list with a different one. +#### PlaceInfo - information about the place + +::: nominatim_db.data.place_info.PlaceInfo + options: + heading_level: 6 + + +#### PlaceName - extended naming information + +::: nominatim_db.data.place_name.PlaceName + options: + heading_level: 6 + + ### Example: Filter for US street prefixes The following sanitizer removes the directional prefixes from street names in the US: -``` python -import re - -def _filter_function(obj): - if obj.place.country_code == 'us' \ - and obj.place.rank_address >= 26 and obj.place.rank_address <= 27: - for name in obj.names: - name.name = re.sub(r'^(north|south|west|east) ', - '', - name.name, - flags=re.IGNORECASE) - -def create(config): - return _filter_function -``` +!!! example + ``` python + import re + + def _filter_function(obj): + if obj.place.country_code == 'us' \ + and obj.place.rank_address >= 26 and obj.place.rank_address <= 27: + for name in obj.names: + name.name = re.sub(r'^(north|south|west|east) ', + '', + name.name, + flags=re.IGNORECASE) + + def create(config): + return _filter_function + ``` This is the most simple form of a sanitizer module. If defines a single filter function and implements the required `create()` function by returning @@ -102,58 +156,39 @@ the filter. The filter function first checks if the object is interesting for the sanitizer. Namely it checks if the place is in the US (through `country_code`) and it the place is a street (a `rank_address` of 26 or 27). If the -conditions are met, then it goes through all available names and replaces -any removes any leading direction prefix using a simple regular expression. +conditions are met, then it goes through all available names and +removes any leading directional prefix using a simple regular expression. Save the source code in a file in your project directory, for example as `us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`: -``` +``` yaml ... sanitizers: - step: us_streets.py ... ``` -For more sanitizer examples, have a look at the sanitizers provided by Nominatim. -They can be found in the directory `nominatim/tokenizer/sanitizers`. - !!! warning This example is just a simplified show case on how to create a sanitizer. - It is not really read for real-world use: while the sanitizer would - correcly transform `West 5th Street` into `5th Street`. it would also + It is not really meant for real-world use: while the sanitizer would + correctly transform `West 5th Street` into `5th Street`. it would also shorten a simple `North Street` to `Street`. -#### PlaceInfo - information about the place - -::: nominatim.data.place_info.PlaceInfo - rendering: - show_source: no - heading_level: 6 - +For more sanitizer examples, have a look at the sanitizers provided by Nominatim. +They can be found in the directory +[`src/nominatim_db/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/src/nominatim_db/tokenizer/sanitizers). -#### PlaceName - extended naming information - -::: nominatim.data.place_name.PlaceName - rendering: - show_source: no - heading_level: 6 ## Custom token analysis module -Setup of a token analyser is split into two parts: configuration and -analyser factory. A token analysis module must therefore implement two -functions: - -::: nominatim.tokenizer.token_analysis.base.AnalysisModule - rendering: - show_source: no +::: nominatim_db.tokenizer.token_analysis.base.AnalysisModule + options: heading_level: 6 -::: nominatim.tokenizer.token_analysis.base.Analyzer - rendering: - show_source: no +::: nominatim_db.tokenizer.token_analysis.base.Analyzer + options: heading_level: 6 ### Example: Creating acronym variants for long names