X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/e7574f119eaab63723165f8139455d8af365a21e..ad214753fcdd790916aa1b3b35e679616f4eeb07:/docs/develop/ICU-Tokenizer-Modules.md

diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md
index 2427ab11..f19002c2 100644
--- a/docs/develop/ICU-Tokenizer-Modules.md
+++ b/docs/develop/ICU-Tokenizer-Modules.md
@@ -14,10 +14,11 @@ of sanitizers and token analysis.
     implemented, it is not guaranteed to be stable at the moment.
 
 
-## Using non-standard sanitizers and token analyzers
+## Using non-standard modules
 
-Sanitizer names (in the `step` property) and token analysis names (in the
-`analyzer`) may refer to externally supplied modules. There are two ways
+Sanitizer names (in the `step` property), token analysis names (in the
+`analyzer`) and query preprocessor names (in the `step` property)
+may refer to externally supplied modules. There are two ways
 to include external modules: through a library or from the project directory.
 
 To include a module from a library, use the absolute import path as name and
@@ -27,6 +28,47 @@ To use a custom module without creating a library, you can put the module
 somewhere in your project directory and then use the relative path to the
 file. Include the whole name of the file including the `.py` ending.
 
+## Custom query preprocessors
+
+A query preprocessor must export a single factory function `create` with
+the following signature:
+
+``` python
+create(self, config: QueryConfig) -> Callable[[list[Phrase]], list[Phrase]]
+```
+
+The function receives the custom configuration for the preprocessor and
+returns a callable (function or class) with the actual preprocessing
+code. When a query comes in, then the callable gets a list of phrases
+and needs to return the transformed list of phrases. The list and phrases
+may be changed in place or a completely new list may be generated.
+
+The `QueryConfig` is a simple dictionary which contains all configuration
+options given in the yaml configuration of the ICU tokenizer. It is up to
+the function to interpret the values.
+
+A `nominatim_api.search.Phrase` describes a part of the query that contains one or more independent
+search terms. Breaking a query into phrases helps reducing the number of
+possible tokens Nominatim has to take into account. However a phrase break
+is definitive: a multi-term search word cannot go over a phrase break.
+A Phrase object has two fields:
+
+ * `ptype` further refines the type of phrase (see list below)
+ * `text` contains the query text for the phrase
+
+The order of phrases matters to Nominatim when doing further processing.
+Thus, while you may split or join phrases, you should not reorder them
+unless you really know what you are doing.
+
+Phrase types (`nominatim_api.search.PhraseType`) can further help narrowing
+down how the tokens in the phrase are interpreted. The following phrase types
+are known:
+
+::: nominatim_api.search.PhraseType
+    options:
+        heading_level: 6
+
+
 ## Custom sanitizer modules
 
 A sanitizer module must export a single factory function `create` with the
@@ -52,48 +94,60 @@ the function.
 
 ### Sanitizer configuration
 
-::: nominatim.tokenizer.sanitizers.config.SanitizerConfig
-    rendering:
-        show_source: no
+::: nominatim_db.tokenizer.sanitizers.config.SanitizerConfig
+    options:
         heading_level: 6
 
-### The sanitation function
+### The main filter function of the sanitizer
 
-The sanitation function receives a single object of type `ProcessInfo`
+The filter function receives a single object of type `ProcessInfo`
 which has with three members:
 
- * `place`: read-only information about the place being processed.
+ * `place: PlaceInfo`: read-only information about the place being processed.
    See PlaceInfo below.
- * `names`: The current list of names for the place. Each name is a
-   PlaceName object.
- * `address`: The current list of address names for the place. Each name
-   is a PlaceName object.
+ * `names: List[PlaceName]`: The current list of names for the place.
+ * `address: List[PlaceName]`: The current list of address names for the place.
 
 While the `place` member is provided for information only, the `names` and
 `address` lists are meant to be manipulated by the sanitizer. It may add and
 remove entries, change information within a single entry (for example by
 adding extra attributes) or completely replace the list with a different one.
 
+#### PlaceInfo - information about the place
+
+::: nominatim_db.data.place_info.PlaceInfo
+    options:
+        heading_level: 6
+
+
+#### PlaceName - extended naming information
+
+::: nominatim_db.data.place_name.PlaceName
+    options:
+        heading_level: 6
+
+
 ### Example: Filter for US street prefixes
 
 The following sanitizer removes the directional prefixes from street names
 in the US:
 
-``` python
-import re
-
-def _filter_function(obj):
-    if obj.place.country_code == 'us' \
-       and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
-        for name in obj.names:
-            name.name = re.sub(r'^(north|south|west|east) ',
-                               '',
-                               name.name,
-                               flags=re.IGNORECASE)
-
-def create(config):
-    return _filter_function
-```
+!!! example
+    ``` python
+    import re
+
+    def _filter_function(obj):
+        if obj.place.country_code == 'us' \
+           and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
+            for name in obj.names:
+                name.name = re.sub(r'^(north|south|west|east) ',
+                                   '',
+                                   name.name,
+                                   flags=re.IGNORECASE)
+
+    def create(config):
+        return _filter_function
+    ```
 
 This is the most simple form of a sanitizer module. If defines a single
 filter function and implements the required `create()` function by returning
@@ -102,58 +156,39 @@ the filter.
 The filter function first checks if the object is interesting for the
 sanitizer. Namely it checks if the place is in the US (through `country_code`)
 and it the place is a street (a `rank_address` of 26 or 27). If the
-conditions are met, then it goes through all available names and replaces
-any removes any leading direction prefix using a simple regular expression.
+conditions are met, then it goes through all available names and
+removes any leading directional prefix using a simple regular expression.
 
 Save the source code in a file in your project directory, for example as
 `us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`:
 
-```
+``` yaml
 ...
 sanitizers:
     - step: us_streets.py
 ...
 ```
 
-For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
-They can be found in the directory `nominatim/tokenizer/sanitizers`.
-
 !!! warning
     This example is just a simplified show case on how to create a sanitizer.
-    It is not really read for real-world use: while the sanitizer would
-    correcly transform `West 5th Street` into `5th Street`. it would also
+    It is not really meant for real-world use: while the sanitizer would
+    correctly transform `West 5th Street` into `5th Street`. it would also
     shorten a simple `North Street` to `Street`.
 
-#### PlaceInfo - information about the place
-
-::: nominatim.data.place_info.PlaceInfo
-    rendering:
-        show_source: no
-        heading_level: 6
-
+For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
+They can be found in the directory
+[`src/nominatim_db/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/src/nominatim_db/tokenizer/sanitizers).
 
-#### PlaceName - extended naming information
-
-::: nominatim.data.place_name.PlaceName
-    rendering:
-        show_source: no
-        heading_level: 6
 
 ## Custom token analysis module
 
-Setup of a token analyser is split into two parts: configuration and
-analyser factory. A token analysis module must therefore implement two
-functions:
-
-::: nominatim.tokenizer.token_analysis.base.AnalysisModule
-    rendering:
-        show_source: no
+::: nominatim_db.tokenizer.token_analysis.base.AnalysisModule
+    options:
         heading_level: 6
 
 
-::: nominatim.tokenizer.token_analysis.base.Analyzer
-    rendering:
-        show_source: no
+::: nominatim_db.tokenizer.token_analysis.base.Analyzer
+    options:
         heading_level: 6
 
 ### Example: Creating acronym variants for long names