From: Sarah Hoffmann Date: Sun, 31 Jul 2022 17:15:50 +0000 (+0200) Subject: Merge pull request #2784 from lonvia/doscs-customizing-icu-tokenizer X-Git-Tag: v4.1.0~4 X-Git-Url: https://git.openstreetmap.org./nominatim.git/commitdiff_plain/e427712cb04baf001d41e34af46bb9fd083202a1?hp=a8b037669ac8a9f52ad0091b83ae4f7f9b78b28e Merge pull request #2784 from lonvia/doscs-customizing-icu-tokenizer Document the public API of sanitizers and token analysis modules --- diff --git a/docs/develop/Development-Environment.md b/docs/develop/Development-Environment.md index 6bb33f00..58f802f1 100644 --- a/docs/develop/Development-Environment.md +++ b/docs/develop/Development-Environment.md @@ -40,7 +40,8 @@ It has the following additional requirements: The documentation is built with mkdocs: * [mkdocs](https://www.mkdocs.org/) >= 1.1.2 -* [mkdocstrings](https://mkdocstrings.github.io/) +* [mkdocstrings](https://mkdocstrings.github.io/) >= 0.16 +* [mkdocstrings-python-legacy](https://mkdocstrings.github.io/python-legacy/) ### Installing prerequisites on Ubuntu/Debian diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md new file mode 100644 index 00000000..2cf30a56 --- /dev/null +++ b/docs/develop/ICU-Tokenizer-Modules.md @@ -0,0 +1,227 @@ +# Writing custom sanitizer and token analysis modules for the ICU tokenizer + +The [ICU tokenizer](../customize/Tokenizers.md#icu-tokenizer) provides a +highly customizable method to pre-process and normalize the name information +of the input data before it is added to the search index. It comes with a +selection of sanitizers and token analyzers which you can use to adapt your +installation to your needs. If the provided modules are not enough, you can +also provide your own implementations. This section describes the API +of sanitizers and token analysis. + +!!! warning + This API is currently in early alpha status. While this API is meant to + be a public API on which other sanitizers and token analyzers may be + implemented, it is not guaranteed to be stable at the moment. + + +## Using non-standard sanitizers and token analyzers + +Sanitizer names (in the `step` property) and token analysis names (in the +`analyzer`) may refer to externally supplied modules. There are two ways +to include external modules: through a library or from the project directory. + +To include a module from a library, use the absolute import path as name and +make sure the library can be found in your PYTHONPATH. + +To use a custom module without creating a library, you can put the module +somewhere in your project directory and then use the relative path to the +file. Include the whole name of the file including the `.py` ending. + +## Custom sanitizer modules + +A sanitizer module must export a single factory function `create` with the +following signature: + +``` python +def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None] +``` + +The function receives the custom configuration for the sanitizer and must +return a callable (function or class) that transforms the name and address +terms of a place. When a place is processed, then a `ProcessInfo` object +is created from the information that was queried from the database. This +object is sequentially handed to each configured sanitizer, so that each +sanitizer receives the result of processing from the previous sanitizer. +After the last sanitizer is finished, the resulting name and address lists +are forwarded to the token analysis module. + +Sanitizer functions are instantiated once and then called for each place +that is imported or updated. They don't need to be thread-safe. +If multi-threading is used, each thread creates their own instance of +the function. + +### Sanitizer configuration + +::: nominatim.tokenizer.sanitizers.config.SanitizerConfig + rendering: + show_source: no + heading_level: 6 + +### The main filter function of the sanitizer + +The filter function receives a single object of type `ProcessInfo` +which has with three members: + + * `place`: read-only information about the place being processed. + See PlaceInfo below. + * `names`: The current list of names for the place. Each name is a + PlaceName object. + * `address`: The current list of address names for the place. Each name + is a PlaceName object. + +While the `place` member is provided for information only, the `names` and +`address` lists are meant to be manipulated by the sanitizer. It may add and +remove entries, change information within a single entry (for example by +adding extra attributes) or completely replace the list with a different one. + +#### PlaceInfo - information about the place + +::: nominatim.data.place_info.PlaceInfo + rendering: + show_source: no + heading_level: 6 + + +#### PlaceName - extended naming information + +::: nominatim.data.place_name.PlaceName + rendering: + show_source: no + heading_level: 6 + + +### Example: Filter for US street prefixes + +The following sanitizer removes the directional prefixes from street names +in the US: + +``` python +import re + +def _filter_function(obj): + if obj.place.country_code == 'us' \ + and obj.place.rank_address >= 26 and obj.place.rank_address <= 27: + for name in obj.names: + name.name = re.sub(r'^(north|south|west|east) ', + '', + name.name, + flags=re.IGNORECASE) + +def create(config): + return _filter_function +``` + +This is the most simple form of a sanitizer module. If defines a single +filter function and implements the required `create()` function by returning +the filter. + +The filter function first checks if the object is interesting for the +sanitizer. Namely it checks if the place is in the US (through `country_code`) +and it the place is a street (a `rank_address` of 26 or 27). If the +conditions are met, then it goes through all available names and +removes any leading directional prefix using a simple regular expression. + +Save the source code in a file in your project directory, for example as +`us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`: + +``` yaml +... +sanitizers: + - step: us_streets.py +... +``` + +!!! warning + This example is just a simplified show case on how to create a sanitizer. + It is not really read for real-world use: while the sanitizer would + correcly transform `West 5th Street` into `5th Street`. it would also + shorten a simple `North Street` to `Street`. + +For more sanitizer examples, have a look at the sanitizers provided by Nominatim. +They can be found in the directory +[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers). + + +## Custom token analysis module + +::: nominatim.tokenizer.token_analysis.base.AnalysisModule + rendering: + show_source: no + heading_level: 6 + + +::: nominatim.tokenizer.token_analysis.base.Analyzer + rendering: + show_source: no + heading_level: 6 + +### Example: Creating acronym variants for long names + +The following example of a token analysis module creates acronyms from +very long names and adds them as a variant: + +``` python +class AcronymMaker: + """ This class is the actual analyzer. + """ + def __init__(self, norm, trans): + self.norm = norm + self.trans = trans + + + def get_canonical_id(self, name): + # In simple cases, the normalized name can be used as a canonical id. + return self.norm.transliterate(name.name).strip() + + + def compute_variants(self, name): + # The transliterated form of the name always makes up a variant. + variants = [self.trans.transliterate(name)] + + # Only create acronyms from very long words. + if len(name) > 20: + # Take the first letter from each word to form the acronym. + acronym = ''.join(w[0] for w in name.split()) + # If that leds to an acronym with at least three letters, + # add the resulting acronym as a variant. + if len(acronym) > 2: + # Never forget to transliterate the variants before returning them. + variants.append(self.trans.transliterate(acronym)) + + return variants + +# The following two functions are the module interface. + +def configure(rules, normalizer, transliterator): + # There is no configuration to parse and no data to set up. + # Just return an empty configuration. + return None + + +def create(normalizer, transliterator, config): + # Return a new instance of our token analysis class above. + return AcronymMaker(normalizer, transliterator) +``` + +Given the name `Trans-Siberian Railway`, the code above would return the full +name `Trans-Siberian Railway` and the acronym `TSR` as variant, so that +searching would work for both. + +## Sanitizers vs. Token analysis - what to use for variants? + +It is not always clear when to implement variations in the sanitizer and +when to write a token analysis module. Just take the acronym example +above: it would also have been possible to write a sanitizer which adds the +acronym as an additional name to the name list. The result would have been +similar. So which should be used when? + +The most important thing to keep in mind is that variants created by the +token analysis are only saved in the word lookup table. They do not need +extra space in the search index. If there are many spelling variations, this +can mean quite a significant amount of space is saved. + +When creating additional names with a sanitizer, these names are completely +independent. In particular, they can be fed into different token analysis +modules. This gives a much greater flexibility but at the price that the +additional names increase the size of the search index. + diff --git a/docs/extra.css b/docs/extra.css index 9289c1d3..3aecf2ef 100644 --- a/docs/extra.css +++ b/docs/extra.css @@ -14,10 +14,11 @@ th { background-color: #eee; } -/* Indentation for mkdocstrings. -div.doc-contents:not(.first) { - padding-left: 25px; - border-left: 4px solid rgba(230, 230, 230); - margin-bottom: 60px; -}*/ +.doc-object h6 { + margin-bottom: 0.8em; + font-size: 120%; +} +.doc-object { + margin-bottom: 1.3em; +} diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 48fe1d0d..43bb533d 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -39,6 +39,7 @@ nav: - 'Database Layout' : 'develop/Database-Layout.md' - 'Indexing' : 'develop/Indexing.md' - 'Tokenizers' : 'develop/Tokenizers.md' + - 'Custom modules for ICU tokenizer': 'develop/ICU-Tokenizer-Modules.md' - 'Setup for Development' : 'develop/Development-Environment.md' - 'Testing' : 'develop/Testing.md' - 'External Data Sources': 'develop/data-sources.md' @@ -58,7 +59,7 @@ plugins: - search - mkdocstrings: handlers: - python: + python-legacy: rendering: show_source: false show_signature_annotations: false diff --git a/nominatim/data/place_info.py b/nominatim/data/place_info.py index 96912a61..ab895352 100644 --- a/nominatim/data/place_info.py +++ b/nominatim/data/place_info.py @@ -11,8 +11,8 @@ the tokenizer. from typing import Optional, Mapping, Any class PlaceInfo: - """ Data class containing all information the tokenizer gets about a - place it should process the names for. + """ This data class contains all information the tokenizer can access + about a place. """ def __init__(self, info: Mapping[str, Any]) -> None: @@ -21,16 +21,25 @@ class PlaceInfo: @property def name(self) -> Optional[Mapping[str, str]]: - """ A dictionary with the names of the place or None if the place - has no names. + """ A dictionary with the names of the place. Keys and values represent + the full key and value of the corresponding OSM tag. Which tags + are saved as names is determined by the import style. + The property may be None if the place has no names. """ return self._info.get('name') @property def address(self) -> Optional[Mapping[str, str]]: - """ A dictionary with the address elements of the place - or None if no address information is available. + """ A dictionary with the address elements of the place. They key + usually corresponds to the suffix part of the key of an OSM + 'addr:*' or 'isin:*' tag. There are also some special keys like + `country` or `country_code` which merge OSM keys that contain + the same information. See [Import Styles][1] for details. + + The property may be None if the place has no address information. + + [1]: ../customize/Import-Styles.md """ return self._info.get('address') @@ -38,28 +47,30 @@ class PlaceInfo: @property def country_code(self) -> Optional[str]: """ The country code of the country the place is in. Guaranteed - to be a two-letter lower-case string or None, if no country - could be found. + to be a two-letter lower-case string. If the place is not inside + any country, the property is set to None. """ return self._info.get('country_code') @property def rank_address(self) -> int: - """ The computed rank address before rank correction. + """ The [rank address][1] before ant rank correction is applied. + + [1]: ../customize/Ranking.md#address-rank """ return self._info.get('rank_address', 0) def is_a(self, key: str, value: str) -> bool: - """ Check if the place's primary tag corresponds to the given + """ Set to True when the place's primary tag corresponds to the given key and value. """ return self._info.get('class') == key and self._info.get('type') == value def is_country(self) -> bool: - """ Check if the place is a valid country boundary. + """ Set to True when the place is a valid country boundary. """ return self.rank_address == 4 \ and self.is_a('boundary', 'administrative') \ diff --git a/nominatim/data/place_name.py b/nominatim/data/place_name.py new file mode 100644 index 00000000..f4c5e0fa --- /dev/null +++ b/nominatim/data/place_name.py @@ -0,0 +1,78 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# This file is part of Nominatim. (https://nominatim.org) +# +# Copyright (C) 2022 by the Nominatim developer community. +# For a full list of authors see the git log. +""" +Data class for a single name of a place. +""" +from typing import Optional, Dict, Mapping + +class PlaceName: + """ Each name and address part of a place is encapsulated in an object of + this class. It saves not only the name proper but also describes the + kind of name with two properties: + + * `kind` describes the name of the OSM key used without any suffixes + (i.e. the part after the colon removed) + * `suffix` contains the suffix of the OSM tag, if any. The suffix + is the part of the key after the first colon. + + In addition to that, a name may have arbitrary additional attributes. + How attributes are used, depends on the sanitizers and token analysers. + The exception is is the 'analyzer' attribute. This attribute determines + which token analysis module will be used to finalize the treatment of + names. + """ + + def __init__(self, name: str, kind: str, suffix: Optional[str]): + self.name = name + self.kind = kind + self.suffix = suffix + self.attr: Dict[str, str] = {} + + + def __repr__(self) -> str: + return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')" + + + def clone(self, name: Optional[str] = None, + kind: Optional[str] = None, + suffix: Optional[str] = None, + attr: Optional[Mapping[str, str]] = None) -> 'PlaceName': + """ Create a deep copy of the place name, optionally with the + given parameters replaced. In the attribute list only the given + keys are updated. The list is not replaced completely. + In particular, the function cannot to be used to remove an + attribute from a place name. + """ + newobj = PlaceName(name or self.name, + kind or self.kind, + suffix or self.suffix) + + newobj.attr.update(self.attr) + if attr: + newobj.attr.update(attr) + + return newobj + + + def set_attr(self, key: str, value: str) -> None: + """ Add the given property to the name. If the property was already + set, then the value is overwritten. + """ + self.attr[key] = value + + + def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]: + """ Return the given property or the value of 'default' if it + is not set. + """ + return self.attr.get(key, default) + + + def has_attr(self, key: str) -> bool: + """ Check if the given attribute is set. + """ + return key in self.attr diff --git a/nominatim/tokenizer/icu_rule_loader.py b/nominatim/tokenizer/icu_rule_loader.py index f461a1f1..4c36282c 100644 --- a/nominatim/tokenizer/icu_rule_loader.py +++ b/nominatim/tokenizer/icu_rule_loader.py @@ -12,13 +12,15 @@ import io import json import logging +from icu import Transliterator + from nominatim.config import flatten_config_list, Configuration from nominatim.db.properties import set_property, get_property from nominatim.db.connection import Connection from nominatim.errors import UsageError from nominatim.tokenizer.place_sanitizer import PlaceSanitizer from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis -from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyser +from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyzer import nominatim.data.country_info LOG = logging.getLogger() @@ -135,6 +137,11 @@ class ICURuleLoader: if not isinstance(self.analysis_rules, list): raise UsageError("Configuration section 'token-analysis' must be a list.") + norm = Transliterator.createFromRules("rule_loader_normalization", + self.normalization_rules) + trans = Transliterator.createFromRules("rule_loader_transliteration", + self.transliteration_rules) + for section in self.analysis_rules: name = section.get('id', None) if name in self.analysis: @@ -144,8 +151,7 @@ class ICURuleLoader: LOG.fatal("ICU tokenizer configuration has two token " "analyzers with id '%s'.", name) raise UsageError("Syntax error in ICU tokenizer config.") - self.analysis[name] = TokenAnalyzerRule(section, - self.normalization_rules, + self.analysis[name] = TokenAnalyzerRule(section, norm, trans, self.config) @@ -170,7 +176,8 @@ class TokenAnalyzerRule: and creates a new token analyzer on request. """ - def __init__(self, rules: Mapping[str, Any], normalization_rules: str, + def __init__(self, rules: Mapping[str, Any], + normalizer: Any, transliterator: Any, config: Configuration) -> None: analyzer_name = _get_section(rules, 'analyzer') if not analyzer_name or not isinstance(analyzer_name, str): @@ -179,10 +186,11 @@ class TokenAnalyzerRule: self._analysis_mod: AnalysisModule = \ config.load_plugin_module(analyzer_name, 'nominatim.tokenizer.token_analysis') - self.config = self._analysis_mod.configure(rules, normalization_rules) + self.config = self._analysis_mod.configure(rules, normalizer, + transliterator) - def create(self, normalizer: Any, transliterator: Any) -> Analyser: + def create(self, normalizer: Any, transliterator: Any) -> Analyzer: """ Create a new analyser instance for the given rule. """ return self._analysis_mod.create(normalizer, transliterator, self.config) diff --git a/nominatim/tokenizer/icu_token_analysis.py b/nominatim/tokenizer/icu_token_analysis.py index 3c4d7298..7ea31e8e 100644 --- a/nominatim/tokenizer/icu_token_analysis.py +++ b/nominatim/tokenizer/icu_token_analysis.py @@ -11,7 +11,7 @@ into a Nominatim token. from typing import Mapping, Optional, TYPE_CHECKING from icu import Transliterator -from nominatim.tokenizer.token_analysis.base import Analyser +from nominatim.tokenizer.token_analysis.base import Analyzer if TYPE_CHECKING: from typing import Any @@ -19,7 +19,7 @@ if TYPE_CHECKING: class ICUTokenAnalysis: """ Container class collecting the transliterators and token analysis - modules for a single NameAnalyser instance. + modules for a single Analyser instance. """ def __init__(self, norm_rules: str, trans_rules: str, @@ -36,7 +36,7 @@ class ICUTokenAnalysis: for name, arules in analysis_rules.items()} - def get_analyzer(self, name: Optional[str]) -> Analyser: + def get_analyzer(self, name: Optional[str]) -> Analyzer: """ Return the given named analyzer. If no analyzer with that name exists, return the default analyzer. """ diff --git a/nominatim/tokenizer/icu_tokenizer.py b/nominatim/tokenizer/icu_tokenizer.py index 83013755..319838a1 100644 --- a/nominatim/tokenizer/icu_tokenizer.py +++ b/nominatim/tokenizer/icu_tokenizer.py @@ -23,7 +23,7 @@ from nominatim.db.sql_preprocessor import SQLPreprocessor from nominatim.data.place_info import PlaceInfo from nominatim.tokenizer.icu_rule_loader import ICURuleLoader from nominatim.tokenizer.place_sanitizer import PlaceSanitizer -from nominatim.tokenizer.sanitizers.base import PlaceName +from nominatim.data.place_name import PlaceName from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis from nominatim.tokenizer.base import AbstractAnalyzer, AbstractTokenizer @@ -324,7 +324,7 @@ class ICUNameAnalyzer(AbstractAnalyzer): postcode_name = place.name.strip().upper() variant_base = None else: - postcode_name = analyzer.normalize(place.name) + postcode_name = analyzer.get_canonical_id(place) variant_base = place.get_attr("variant") if variant_base: @@ -359,7 +359,7 @@ class ICUNameAnalyzer(AbstractAnalyzer): if analyzer is None: variants = [term] else: - variants = analyzer.get_variants_ascii(variant) + variants = analyzer.compute_variants(variant) if term not in variants: variants.append(term) else: @@ -573,17 +573,17 @@ class ICUNameAnalyzer(AbstractAnalyzer): # Otherwise use the analyzer to determine the canonical name. # Per convention we use the first variant as the 'lookup name', the # name that gets saved in the housenumber field of the place. - norm_name = analyzer.normalize(hnr.name) - if norm_name: - result = self._cache.housenumbers.get(norm_name, result) + word_id = analyzer.get_canonical_id(hnr) + if word_id: + result = self._cache.housenumbers.get(word_id, result) if result[0] is None: - variants = analyzer.get_variants_ascii(norm_name) + variants = analyzer.compute_variants(word_id) if variants: with self.conn.cursor() as cur: cur.execute("SELECT create_analyzed_hnr_id(%s, %s)", - (norm_name, list(variants))) + (word_id, list(variants))) result = cur.fetchone()[0], variants[0] # type: ignore[no-untyped-call] - self._cache.housenumbers[norm_name] = result + self._cache.housenumbers[word_id] = result return result @@ -650,15 +650,15 @@ class ICUNameAnalyzer(AbstractAnalyzer): for name in names: analyzer_id = name.get_attr('analyzer') analyzer = self.token_analysis.get_analyzer(analyzer_id) - norm_name = analyzer.normalize(name.name) + word_id = analyzer.get_canonical_id(name) if analyzer_id is None: - token_id = norm_name + token_id = word_id else: - token_id = f'{norm_name}@{analyzer_id}' + token_id = f'{word_id}@{analyzer_id}' full, part = self._cache.names.get(token_id, (None, None)) if full is None: - variants = analyzer.get_variants_ascii(norm_name) + variants = analyzer.compute_variants(word_id) if not variants: continue @@ -688,7 +688,7 @@ class ICUNameAnalyzer(AbstractAnalyzer): postcode_name = item.name.strip().upper() variant_base = None else: - postcode_name = analyzer.normalize(item.name) + postcode_name = analyzer.get_canonical_id(item) variant_base = item.get_attr("variant") if variant_base: @@ -703,7 +703,7 @@ class ICUNameAnalyzer(AbstractAnalyzer): variants = {term} if analyzer is not None and variant_base: - variants.update(analyzer.get_variants_ascii(variant_base)) + variants.update(analyzer.compute_variants(variant_base)) with self.conn.cursor() as cur: cur.execute("SELECT create_postcode_word(%s, %s)", diff --git a/nominatim/tokenizer/place_sanitizer.py b/nominatim/tokenizer/place_sanitizer.py index c7dfd1ba..2f76fe34 100644 --- a/nominatim/tokenizer/place_sanitizer.py +++ b/nominatim/tokenizer/place_sanitizer.py @@ -13,7 +13,8 @@ from typing import Optional, List, Mapping, Sequence, Callable, Any, Tuple from nominatim.errors import UsageError from nominatim.config import Configuration from nominatim.tokenizer.sanitizers.config import SanitizerConfig -from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo, PlaceName +from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo +from nominatim.data.place_name import PlaceName from nominatim.data.place_info import PlaceInfo diff --git a/nominatim/tokenizer/sanitizers/base.py b/nominatim/tokenizer/sanitizers/base.py index 692c6d5f..2de868c7 100644 --- a/nominatim/tokenizer/sanitizers/base.py +++ b/nominatim/tokenizer/sanitizers/base.py @@ -7,74 +7,13 @@ """ Common data types and protocols for sanitizers. """ -from typing import Optional, Dict, List, Mapping, Callable +from typing import Optional, List, Mapping, Callable from nominatim.tokenizer.sanitizers.config import SanitizerConfig from nominatim.data.place_info import PlaceInfo +from nominatim.data.place_name import PlaceName from nominatim.typing import Protocol, Final -class PlaceName: - """ A searchable name for a place together with properties. - Every name object saves the name proper and two basic properties: - * 'kind' describes the name of the OSM key used without any suffixes - (i.e. the part after the colon removed) - * 'suffix' contains the suffix of the OSM tag, if any. The suffix - is the part of the key after the first colon. - In addition to that, the name may have arbitrary additional attributes. - Which attributes are used, depends on the token analyser. - """ - - def __init__(self, name: str, kind: str, suffix: Optional[str]): - self.name = name - self.kind = kind - self.suffix = suffix - self.attr: Dict[str, str] = {} - - - def __repr__(self) -> str: - return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')" - - - def clone(self, name: Optional[str] = None, - kind: Optional[str] = None, - suffix: Optional[str] = None, - attr: Optional[Mapping[str, str]] = None) -> 'PlaceName': - """ Create a deep copy of the place name, optionally with the - given parameters replaced. In the attribute list only the given - keys are updated. The list is not replaced completely. - In particular, the function cannot to be used to remove an - attribute from a place name. - """ - newobj = PlaceName(name or self.name, - kind or self.kind, - suffix or self.suffix) - - newobj.attr.update(self.attr) - if attr: - newobj.attr.update(attr) - - return newobj - - - def set_attr(self, key: str, value: str) -> None: - """ Add the given property to the name. If the property was already - set, then the value is overwritten. - """ - self.attr[key] = value - - - def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]: - """ Return the given property or the value of 'default' if it - is not set. - """ - return self.attr.get(key, default) - - - def has_attr(self, key: str) -> bool: - """ Check if the given attribute is set. - """ - return key in self.attr - class ProcessInfo: """ Container class for information handed into to handler functions. @@ -113,7 +52,13 @@ class SanitizerHandler(Protocol): def create(self, config: SanitizerConfig) -> Callable[[ProcessInfo], None]: """ - A sanitizer must define a single function `create`. It takes the - dictionary with the configuration information for the sanitizer and - returns a function that transforms name and address. + Create a function for sanitizing a place. + + Arguments: + config: A dictionary with the additional configuration options + specified in the tokenizer configuration + + Return: + The result must be a callable that takes a place description + and transforms name and address as reuqired. """ diff --git a/nominatim/tokenizer/sanitizers/clean_housenumbers.py b/nominatim/tokenizer/sanitizers/clean_housenumbers.py index 5df057d0..417d68d2 100644 --- a/nominatim/tokenizer/sanitizers/clean_housenumbers.py +++ b/nominatim/tokenizer/sanitizers/clean_housenumbers.py @@ -27,7 +27,8 @@ Arguments: from typing import Callable, Iterator, List import re -from nominatim.tokenizer.sanitizers.base import ProcessInfo, PlaceName +from nominatim.tokenizer.sanitizers.base import ProcessInfo +from nominatim.data.place_name import PlaceName from nominatim.tokenizer.sanitizers.config import SanitizerConfig class _HousenumberSanitizer: diff --git a/nominatim/tokenizer/sanitizers/config.py b/nominatim/tokenizer/sanitizers/config.py index f6abf20c..8b9164c6 100644 --- a/nominatim/tokenizer/sanitizers/config.py +++ b/nominatim/tokenizer/sanitizers/config.py @@ -21,19 +21,25 @@ else: _BaseUserDict = UserDict class SanitizerConfig(_BaseUserDict): - """ Dictionary with configuration options for a sanitizer. - - In addition to the usual dictionary function, the class provides - accessors to standard sanatizer options that are used by many of the + """ The `SanitizerConfig` class is a read-only dictionary + with configuration options for the sanitizer. + In addition to the usual dictionary functions, the class provides + accessors to standard sanitizer options that are used by many of the sanitizers. """ def get_string_list(self, param: str, default: Sequence[str] = tuple()) -> Sequence[str]: """ Extract a configuration parameter as a string list. - If the parameter value is a simple string, it is returned as a - one-item list. If the parameter value does not exist, the given - default is returned. If the parameter value is a list, it is checked - to contain only strings before being returned. + + Arguments: + param: Name of the configuration parameter. + default: Value to return, when the parameter is missing. + + Returns: + If the parameter value is a simple string, it is returned as a + one-item list. If the parameter value does not exist, the given + default is returned. If the parameter value is a list, it is + checked to contain only strings before being returned. """ values = self.data.get(param, None) @@ -54,9 +60,16 @@ class SanitizerConfig(_BaseUserDict): def get_bool(self, param: str, default: Optional[bool] = None) -> bool: """ Extract a configuration parameter as a boolean. - The parameter must be one of the yaml boolean values or an - user error will be raised. If `default` is given, then the parameter - may also be missing or empty. + + Arguments: + param: Name of the configuration parameter. The parameter must + contain one of the yaml boolean values or an + UsageError will be raised. + default: Value to return, when the parameter is missing. + When set to `None`, the parameter must be defined. + + Returns: + Boolean value of the given parameter. """ value = self.data.get(param, default) @@ -67,15 +80,20 @@ class SanitizerConfig(_BaseUserDict): def get_delimiter(self, default: str = ',;') -> Pattern[str]: - """ Return the 'delimiter' parameter in the configuration as a - compiled regular expression that can be used to split the names on the - delimiters. The regular expression makes sure that the resulting names - are stripped and that repeated delimiters - are ignored but it will still create empty fields on occasion. The - code needs to filter those. - - The 'default' parameter defines the delimiter set to be used when - not explicitly configured. + """ Return the 'delimiters' parameter in the configuration as a + compiled regular expression that can be used to split strings on + these delimiters. + + Arguments: + default: Delimiters to be used when 'delimiters' parameter + is not explicitly configured. + + Returns: + A regular expression pattern which can be used to + split a string. The regular expression makes sure that the + resulting names are stripped and that repeated delimiters + are ignored. It may still create empty fields on occasion. The + code needs to filter those. """ delimiter_set = set(self.data.get('delimiters', default)) if not delimiter_set: @@ -86,13 +104,22 @@ class SanitizerConfig(_BaseUserDict): def get_filter_kind(self, *default: str) -> Callable[[str], bool]: """ Return a filter function for the name kind from the 'filter-kind' - config parameter. The filter functions takes a name item and returns - True when the item passes the filter. + config parameter. - If the parameter is empty, the filter lets all items pass. If the - parameter is a string, it is interpreted as a single regular expression - that must match the full kind string. If the parameter is a list then + If the 'filter-kind' parameter is empty, the filter lets all items + pass. If the parameter is a string, it is interpreted as a single + regular expression that must match the full kind string. + If the parameter is a list then any of the regular expressions in the list must match to pass. + + Arguments: + default: Filters to be used, when the 'filter-kind' parameter + is not specified. If omitted then the default is to + let all names pass. + + Returns: + A filter function which takes a name string and returns + True when the item passes the filter. """ filters = self.get_string_list('filter-kind', default) diff --git a/nominatim/tokenizer/token_analysis/base.py b/nominatim/tokenizer/token_analysis/base.py index b2a4386c..68046f96 100644 --- a/nominatim/tokenizer/token_analysis/base.py +++ b/nominatim/tokenizer/token_analysis/base.py @@ -10,33 +10,87 @@ Common data types and protocols for analysers. from typing import Mapping, List, Any from nominatim.typing import Protocol +from nominatim.data.place_name import PlaceName -class Analyser(Protocol): - """ Instance of the token analyser. +class Analyzer(Protocol): + """ The `create()` function of an analysis module needs to return an + object that implements the following functions. """ - def normalize(self, name: str) -> str: - """ Return the normalized form of the name. This is the standard form - from which possible variants for the name can be derived. + def get_canonical_id(self, name: PlaceName) -> str: + """ Return the canonical form of the given name. The canonical ID must + be unique (the same ID must always yield the same variants) and + must be a form from which the variants can be derived. + + Arguments: + name: Extended place name description as prepared by + the sanitizers. + + Returns: + ID string with a canonical form of the name. The string may + be empty, when the analyzer cannot analyze the name at all, + for example because the character set in use does not match. """ - def get_variants_ascii(self, norm_name: str) -> List[str]: - """ Compute the spelling variants for the given normalized name - and transliterate the result. + def compute_variants(self, canonical_id: str) -> List[str]: + """ Compute the transliterated spelling variants for the given + canonical ID. + + Arguments: + canonical_id: ID string previously computed with + `get_canonical_id()`. + + Returns: + A list of possible spelling variants. All strings must have + been transformed with the global normalizer and + transliterator ICU rules. Otherwise they cannot be matched + against the input by the query frontend. + The list may be empty, when there are no useful + spelling variants. This may happen when an analyzer only + usually outputs additional variants to the canonical spelling + and there are no such variants. """ + class AnalysisModule(Protocol): - """ Protocol for analysis modules. + """ The setup of the token analysis is split into two parts: + configuration and analyser factory. A token analysis module must + therefore implement the two functions here described. """ - def configure(self, rules: Mapping[str, Any], normalization_rules: str) -> Any: + def configure(self, rules: Mapping[str, Any], + normalizer: Any, transliterator: Any) -> Any: """ Prepare the configuration of the analysis module. This function should prepare all data that can be shared between instances of this analyser. + + Arguments: + rules: A dictionary with the additional configuration options + as specified in the tokenizer configuration. + normalizer: an ICU Transliterator with the compiled + global normalization rules. + transliterator: an ICU Transliterator with the compiled + global transliteration rules. + + Returns: + A data object with configuration data. This will be handed + as is into the `create()` function and may be + used freely by the analysis module as needed. """ - def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyser: + def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyzer: """ Create a new instance of the analyser. A separate instance of the analyser is created for each thread when used in multi-threading context. + + Arguments: + normalizer: an ICU Transliterator with the compiled normalization + rules. + transliterator: an ICU Transliterator with the compiled + transliteration rules. + config: The object that was returned by the call to configure(). + + Returns: + A new analyzer instance. This must be an object that implements + the Analyzer protocol. """ diff --git a/nominatim/tokenizer/token_analysis/config_variants.py b/nominatim/tokenizer/token_analysis/config_variants.py index d86d8072..1258373e 100644 --- a/nominatim/tokenizer/token_analysis/config_variants.py +++ b/nominatim/tokenizer/token_analysis/config_variants.py @@ -12,8 +12,6 @@ from collections import defaultdict import itertools import re -from icu import Transliterator - from nominatim.config import flatten_config_list from nominatim.errors import UsageError @@ -25,7 +23,7 @@ class ICUVariant(NamedTuple): def get_variant_config(in_rules: Any, - normalization_rules: str) -> Tuple[List[Tuple[str, List[str]]], str]: + normalizer: Any) -> Tuple[List[Tuple[str, List[str]]], str]: """ Convert the variant definition from the configuration into replacement sets. @@ -39,7 +37,7 @@ def get_variant_config(in_rules: Any, vset: Set[ICUVariant] = set() rules = flatten_config_list(in_rules, 'variants') - vmaker = _VariantMaker(normalization_rules) + vmaker = _VariantMaker(normalizer) for section in rules: for rule in (section.get('words') or []): @@ -63,9 +61,8 @@ class _VariantMaker: All text in rules is normalized to make sure the variants match later. """ - def __init__(self, norm_rules: Any) -> None: - self.norm = Transliterator.createFromRules("rule_loader_normalization", - norm_rules) + def __init__(self, normalizer: Any) -> None: + self.norm = normalizer def compute(self, rule: Any) -> Iterator[ICUVariant]: diff --git a/nominatim/tokenizer/token_analysis/generic.py b/nominatim/tokenizer/token_analysis/generic.py index e14f844c..1ed9bf4d 100644 --- a/nominatim/tokenizer/token_analysis/generic.py +++ b/nominatim/tokenizer/token_analysis/generic.py @@ -13,18 +13,19 @@ import itertools import datrie from nominatim.errors import UsageError +from nominatim.data.place_name import PlaceName from nominatim.tokenizer.token_analysis.config_variants import get_variant_config from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator ### Configuration section -def configure(rules: Mapping[str, Any], normalization_rules: str) -> Dict[str, Any]: +def configure(rules: Mapping[str, Any], normalizer: Any, _: Any) -> Dict[str, Any]: """ Extract and preprocess the configuration for this module. """ config: Dict[str, Any] = {} config['replacements'], config['chars'] = get_variant_config(rules.get('variants'), - normalization_rules) + normalizer) config['variant_only'] = rules.get('mode', '') == 'variant-only' # parse mutation rules @@ -77,14 +78,14 @@ class GenericTokenAnalysis: self.mutations = [MutationVariantGenerator(*cfg) for cfg in config['mutations']] - def normalize(self, name: str) -> str: + def get_canonical_id(self, name: PlaceName) -> str: """ Return the normalized form of the name. This is the standard form from which possible variants for the name can be derived. """ - return cast(str, self.norm.transliterate(name)).strip() + return cast(str, self.norm.transliterate(name.name)).strip() - def get_variants_ascii(self, norm_name: str) -> List[str]: + def compute_variants(self, norm_name: str) -> List[str]: """ Compute the spelling variants for the given normalized name and transliterate the result. """ diff --git a/nominatim/tokenizer/token_analysis/housenumbers.py b/nominatim/tokenizer/token_analysis/housenumbers.py index a0f4214d..a8ad3ecb 100644 --- a/nominatim/tokenizer/token_analysis/housenumbers.py +++ b/nominatim/tokenizer/token_analysis/housenumbers.py @@ -8,9 +8,10 @@ Specialized processor for housenumbers. Analyses common housenumber patterns and creates variants for them. """ -from typing import Mapping, Any, List, cast +from typing import Any, List, cast import re +from nominatim.data.place_name import PlaceName from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator RE_NON_DIGIT = re.compile('[^0-9]') @@ -20,7 +21,7 @@ RE_NAMED_PART = re.compile(r'[a-z]{4}') ### Configuration section -def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613 +def configure(*_: Any) -> None: """ All behaviour is currently hard-coded. """ return None @@ -42,14 +43,14 @@ class HousenumberTokenAnalysis: self.mutator = MutationVariantGenerator('␣', (' ', '')) - def normalize(self, name: str) -> str: + def get_canonical_id(self, name: PlaceName) -> str: """ Return the normalized form of the housenumber. """ # shortcut for number-only numbers, which make up 90% of the data. - if RE_NON_DIGIT.search(name) is None: - return name + if RE_NON_DIGIT.search(name.name) is None: + return name.name - norm = cast(str, self.trans.transliterate(self.norm.transliterate(name))) + norm = cast(str, self.trans.transliterate(self.norm.transliterate(name.name))) # If there is a significant non-numeric part, use as is. if RE_NAMED_PART.search(norm) is None: # Otherwise add optional spaces between digits and letters. @@ -61,7 +62,7 @@ class HousenumberTokenAnalysis: return norm - def get_variants_ascii(self, norm_name: str) -> List[str]: + def compute_variants(self, norm_name: str) -> List[str]: """ Compute the spelling variants for the given normalized housenumber. Generates variants for optional spaces (marked with '␣'). diff --git a/nominatim/tokenizer/token_analysis/postcodes.py b/nominatim/tokenizer/token_analysis/postcodes.py index 15b20bf9..94e93645 100644 --- a/nominatim/tokenizer/token_analysis/postcodes.py +++ b/nominatim/tokenizer/token_analysis/postcodes.py @@ -8,13 +8,14 @@ Specialized processor for postcodes. Supports a 'lookup' variant of the token, which produces variants with optional spaces. """ -from typing import Mapping, Any, List +from typing import Any, List from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator +from nominatim.data.place_name import PlaceName ### Configuration section -def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613 +def configure(*_: Any) -> None: """ All behaviour is currently hard-coded. """ return None @@ -31,10 +32,8 @@ class PostcodeTokenAnalysis: """ Special normalization and variant generation for postcodes. This analyser must not be used with anything but postcodes as - it follows some special rules: `normalize` doesn't necessarily - need to return a standard form as per normalization rules. It - needs to return the canonical form of the postcode that is also - used for output. `get_variants_ascii` then needs to ensure that + it follows some special rules: the canonial ID is the form that + is used for the output. `compute_variants` then needs to ensure that the generated variants once more follow the standard normalization and transliteration, so that postcodes are correctly recognised by the search algorithm. @@ -46,13 +45,13 @@ class PostcodeTokenAnalysis: self.mutator = MutationVariantGenerator(' ', (' ', '')) - def normalize(self, name: str) -> str: + def get_canonical_id(self, name: PlaceName) -> str: """ Return the standard form of the postcode. """ - return name.strip().upper() + return name.name.strip().upper() - def get_variants_ascii(self, norm_name: str) -> List[str]: + def compute_variants(self, norm_name: str) -> List[str]: """ Compute the spelling variants for the given normalized postcode. Takes the canonical form of the postcode, normalizes it using the diff --git a/test/python/tokenizer/token_analysis/test_analysis_postcodes.py b/test/python/tokenizer/token_analysis/test_analysis_postcodes.py index 623bed54..8d966c46 100644 --- a/test/python/tokenizer/token_analysis/test_analysis_postcodes.py +++ b/test/python/tokenizer/token_analysis/test_analysis_postcodes.py @@ -12,6 +12,7 @@ import pytest from icu import Transliterator import nominatim.tokenizer.token_analysis.postcodes as module +from nominatim.data.place_name import PlaceName from nominatim.errors import UsageError DEFAULT_NORMALIZATION = """ :: NFD (); @@ -39,22 +40,22 @@ def analyser(): def get_normalized_variants(proc, name): norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) - return proc.get_variants_ascii(norm.transliterate(name).strip()) + return proc.compute_variants(norm.transliterate(name).strip()) @pytest.mark.parametrize('name,norm', [('12', '12'), ('A 34 ', 'A 34'), ('34-av', '34-AV')]) -def test_normalize(analyser, name, norm): - assert analyser.normalize(name) == norm +def test_get_canonical_id(analyser, name, norm): + assert analyser.get_canonical_id(PlaceName(name=name, kind='', suffix='')) == norm @pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}), ('AB-998', {'ab 998', 'ab998'}), ('23 FGH D3', {'23 fgh d3', '23fgh d3', '23 fghd3', '23fghd3'})]) -def test_get_variants_ascii(analyser, postcode, variants): - out = analyser.get_variants_ascii(postcode) +def test_compute_variants(analyser, postcode, variants): + out = analyser.compute_variants(postcode) assert len(out) == len(set(out)) assert set(out) == variants diff --git a/test/python/tokenizer/token_analysis/test_generic.py b/test/python/tokenizer/token_analysis/test_generic.py index afbd5e9b..976bbd1b 100644 --- a/test/python/tokenizer/token_analysis/test_generic.py +++ b/test/python/tokenizer/token_analysis/test_generic.py @@ -30,23 +30,23 @@ def make_analyser(*variants, variant_only=False): rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]} if variant_only: rules['mode'] = 'variant-only' - config = module.configure(rules, DEFAULT_NORMALIZATION) trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) + config = module.configure(rules, norm, trans) return module.create(norm, trans, config) def get_normalized_variants(proc, name): norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) - return proc.get_variants_ascii(norm.transliterate(name).strip()) + return proc.compute_variants(norm.transliterate(name).strip()) def test_no_variants(): rules = { 'analyzer': 'generic' } - config = module.configure(rules, DEFAULT_NORMALIZATION) trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) + config = module.configure(rules, norm, trans) proc = module.create(norm, trans, config) @@ -123,7 +123,9 @@ class TestGetReplacements: @staticmethod def configure_rules(*variants): rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]} - return module.configure(rules, DEFAULT_NORMALIZATION) + trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) + norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) + return module.configure(rules, norm, trans) def get_replacements(self, *variants): diff --git a/test/python/tokenizer/token_analysis/test_generic_mutation.py b/test/python/tokenizer/token_analysis/test_generic_mutation.py index abe31f6d..ff4c3a74 100644 --- a/test/python/tokenizer/token_analysis/test_generic_mutation.py +++ b/test/python/tokenizer/token_analysis/test_generic_mutation.py @@ -31,16 +31,16 @@ class TestMutationNoVariants: 'mutations': [ {'pattern': m[0], 'replacements': m[1]} for m in mutations] } - config = module.configure(rules, DEFAULT_NORMALIZATION) trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION) norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) + config = module.configure(rules, norm, trans) self.analysis = module.create(norm, trans, config) def variants(self, name): norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION) - return set(self.analysis.get_variants_ascii(norm.transliterate(name).strip())) + return set(self.analysis.compute_variants(norm.transliterate(name).strip())) @pytest.mark.parametrize('pattern', ('(capture)', ['a list']))