The documentation is built with mkdocs:
* [mkdocs](https://www.mkdocs.org/) >= 1.1.2
-* [mkdocstrings](https://mkdocstrings.github.io/)
+* [mkdocstrings](https://mkdocstrings.github.io/) >= 0.16
+* [mkdocstrings-python-legacy](https://mkdocstrings.github.io/python-legacy/)
### Installing prerequisites on Ubuntu/Debian
--- /dev/null
+# Writing custom sanitizer and token analysis modules for the ICU tokenizer
+
+The [ICU tokenizer](../customize/Tokenizers.md#icu-tokenizer) provides a
+highly customizable method to pre-process and normalize the name information
+of the input data before it is added to the search index. It comes with a
+selection of sanitizers and token analyzers which you can use to adapt your
+installation to your needs. If the provided modules are not enough, you can
+also provide your own implementations. This section describes the API
+of sanitizers and token analysis.
+
+!!! warning
+ This API is currently in early alpha status. While this API is meant to
+ be a public API on which other sanitizers and token analyzers may be
+ implemented, it is not guaranteed to be stable at the moment.
+
+
+## Using non-standard sanitizers and token analyzers
+
+Sanitizer names (in the `step` property) and token analysis names (in the
+`analyzer`) may refer to externally supplied modules. There are two ways
+to include external modules: through a library or from the project directory.
+
+To include a module from a library, use the absolute import path as name and
+make sure the library can be found in your PYTHONPATH.
+
+To use a custom module without creating a library, you can put the module
+somewhere in your project directory and then use the relative path to the
+file. Include the whole name of the file including the `.py` ending.
+
+## Custom sanitizer modules
+
+A sanitizer module must export a single factory function `create` with the
+following signature:
+
+``` python
+def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None]
+```
+
+The function receives the custom configuration for the sanitizer and must
+return a callable (function or class) that transforms the name and address
+terms of a place. When a place is processed, then a `ProcessInfo` object
+is created from the information that was queried from the database. This
+object is sequentially handed to each configured sanitizer, so that each
+sanitizer receives the result of processing from the previous sanitizer.
+After the last sanitizer is finished, the resulting name and address lists
+are forwarded to the token analysis module.
+
+Sanitizer functions are instantiated once and then called for each place
+that is imported or updated. They don't need to be thread-safe.
+If multi-threading is used, each thread creates their own instance of
+the function.
+
+### Sanitizer configuration
+
+::: nominatim.tokenizer.sanitizers.config.SanitizerConfig
+ rendering:
+ show_source: no
+ heading_level: 6
+
+### The main filter function of the sanitizer
+
+The filter function receives a single object of type `ProcessInfo`
+which has with three members:
+
+ * `place`: read-only information about the place being processed.
+ See PlaceInfo below.
+ * `names`: The current list of names for the place. Each name is a
+ PlaceName object.
+ * `address`: The current list of address names for the place. Each name
+ is a PlaceName object.
+
+While the `place` member is provided for information only, the `names` and
+`address` lists are meant to be manipulated by the sanitizer. It may add and
+remove entries, change information within a single entry (for example by
+adding extra attributes) or completely replace the list with a different one.
+
+#### PlaceInfo - information about the place
+
+::: nominatim.data.place_info.PlaceInfo
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+#### PlaceName - extended naming information
+
+::: nominatim.data.place_name.PlaceName
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+### Example: Filter for US street prefixes
+
+The following sanitizer removes the directional prefixes from street names
+in the US:
+
+``` python
+import re
+
+def _filter_function(obj):
+ if obj.place.country_code == 'us' \
+ and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
+ for name in obj.names:
+ name.name = re.sub(r'^(north|south|west|east) ',
+ '',
+ name.name,
+ flags=re.IGNORECASE)
+
+def create(config):
+ return _filter_function
+```
+
+This is the most simple form of a sanitizer module. If defines a single
+filter function and implements the required `create()` function by returning
+the filter.
+
+The filter function first checks if the object is interesting for the
+sanitizer. Namely it checks if the place is in the US (through `country_code`)
+and it the place is a street (a `rank_address` of 26 or 27). If the
+conditions are met, then it goes through all available names and
+removes any leading directional prefix using a simple regular expression.
+
+Save the source code in a file in your project directory, for example as
+`us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`:
+
+``` yaml
+...
+sanitizers:
+ - step: us_streets.py
+...
+```
+
+!!! warning
+ This example is just a simplified show case on how to create a sanitizer.
+ It is not really read for real-world use: while the sanitizer would
+ correcly transform `West 5th Street` into `5th Street`. it would also
+ shorten a simple `North Street` to `Street`.
+
+For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
+They can be found in the directory
+[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers).
+
+
+## Custom token analysis module
+
+::: nominatim.tokenizer.token_analysis.base.AnalysisModule
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+::: nominatim.tokenizer.token_analysis.base.Analyzer
+ rendering:
+ show_source: no
+ heading_level: 6
+
+### Example: Creating acronym variants for long names
+
+The following example of a token analysis module creates acronyms from
+very long names and adds them as a variant:
+
+``` python
+class AcronymMaker:
+ """ This class is the actual analyzer.
+ """
+ def __init__(self, norm, trans):
+ self.norm = norm
+ self.trans = trans
+
+
+ def get_canonical_id(self, name):
+ # In simple cases, the normalized name can be used as a canonical id.
+ return self.norm.transliterate(name.name).strip()
+
+
+ def compute_variants(self, name):
+ # The transliterated form of the name always makes up a variant.
+ variants = [self.trans.transliterate(name)]
+
+ # Only create acronyms from very long words.
+ if len(name) > 20:
+ # Take the first letter from each word to form the acronym.
+ acronym = ''.join(w[0] for w in name.split())
+ # If that leds to an acronym with at least three letters,
+ # add the resulting acronym as a variant.
+ if len(acronym) > 2:
+ # Never forget to transliterate the variants before returning them.
+ variants.append(self.trans.transliterate(acronym))
+
+ return variants
+
+# The following two functions are the module interface.
+
+def configure(rules, normalizer, transliterator):
+ # There is no configuration to parse and no data to set up.
+ # Just return an empty configuration.
+ return None
+
+
+def create(normalizer, transliterator, config):
+ # Return a new instance of our token analysis class above.
+ return AcronymMaker(normalizer, transliterator)
+```
+
+Given the name `Trans-Siberian Railway`, the code above would return the full
+name `Trans-Siberian Railway` and the acronym `TSR` as variant, so that
+searching would work for both.
+
+## Sanitizers vs. Token analysis - what to use for variants?
+
+It is not always clear when to implement variations in the sanitizer and
+when to write a token analysis module. Just take the acronym example
+above: it would also have been possible to write a sanitizer which adds the
+acronym as an additional name to the name list. The result would have been
+similar. So which should be used when?
+
+The most important thing to keep in mind is that variants created by the
+token analysis are only saved in the word lookup table. They do not need
+extra space in the search index. If there are many spelling variations, this
+can mean quite a significant amount of space is saved.
+
+When creating additional names with a sanitizer, these names are completely
+independent. In particular, they can be fed into different token analysis
+modules. This gives a much greater flexibility but at the price that the
+additional names increase the size of the search index.
+
background-color: #eee;
}
-/* Indentation for mkdocstrings.
-div.doc-contents:not(.first) {
- padding-left: 25px;
- border-left: 4px solid rgba(230, 230, 230);
- margin-bottom: 60px;
-}*/
+.doc-object h6 {
+ margin-bottom: 0.8em;
+ font-size: 120%;
+}
+.doc-object {
+ margin-bottom: 1.3em;
+}
- 'Database Layout' : 'develop/Database-Layout.md'
- 'Indexing' : 'develop/Indexing.md'
- 'Tokenizers' : 'develop/Tokenizers.md'
+ - 'Custom modules for ICU tokenizer': 'develop/ICU-Tokenizer-Modules.md'
- 'Setup for Development' : 'develop/Development-Environment.md'
- 'Testing' : 'develop/Testing.md'
- 'External Data Sources': 'develop/data-sources.md'
- search
- mkdocstrings:
handlers:
- python:
+ python-legacy:
rendering:
show_source: false
show_signature_annotations: false
from typing import Optional, Mapping, Any
class PlaceInfo:
- """ Data class containing all information the tokenizer gets about a
- place it should process the names for.
+ """ This data class contains all information the tokenizer can access
+ about a place.
"""
def __init__(self, info: Mapping[str, Any]) -> None:
@property
def name(self) -> Optional[Mapping[str, str]]:
- """ A dictionary with the names of the place or None if the place
- has no names.
+ """ A dictionary with the names of the place. Keys and values represent
+ the full key and value of the corresponding OSM tag. Which tags
+ are saved as names is determined by the import style.
+ The property may be None if the place has no names.
"""
return self._info.get('name')
@property
def address(self) -> Optional[Mapping[str, str]]:
- """ A dictionary with the address elements of the place
- or None if no address information is available.
+ """ A dictionary with the address elements of the place. They key
+ usually corresponds to the suffix part of the key of an OSM
+ 'addr:*' or 'isin:*' tag. There are also some special keys like
+ `country` or `country_code` which merge OSM keys that contain
+ the same information. See [Import Styles][1] for details.
+
+ The property may be None if the place has no address information.
+
+ [1]: ../customize/Import-Styles.md
"""
return self._info.get('address')
@property
def country_code(self) -> Optional[str]:
""" The country code of the country the place is in. Guaranteed
- to be a two-letter lower-case string or None, if no country
- could be found.
+ to be a two-letter lower-case string. If the place is not inside
+ any country, the property is set to None.
"""
return self._info.get('country_code')
@property
def rank_address(self) -> int:
- """ The computed rank address before rank correction.
+ """ The [rank address][1] before ant rank correction is applied.
+
+ [1]: ../customize/Ranking.md#address-rank
"""
return self._info.get('rank_address', 0)
def is_a(self, key: str, value: str) -> bool:
- """ Check if the place's primary tag corresponds to the given
+ """ Set to True when the place's primary tag corresponds to the given
key and value.
"""
return self._info.get('class') == key and self._info.get('type') == value
def is_country(self) -> bool:
- """ Check if the place is a valid country boundary.
+ """ Set to True when the place is a valid country boundary.
"""
return self.rank_address == 4 \
and self.is_a('boundary', 'administrative') \
--- /dev/null
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This file is part of Nominatim. (https://nominatim.org)
+#
+# Copyright (C) 2022 by the Nominatim developer community.
+# For a full list of authors see the git log.
+"""
+Data class for a single name of a place.
+"""
+from typing import Optional, Dict, Mapping
+
+class PlaceName:
+ """ Each name and address part of a place is encapsulated in an object of
+ this class. It saves not only the name proper but also describes the
+ kind of name with two properties:
+
+ * `kind` describes the name of the OSM key used without any suffixes
+ (i.e. the part after the colon removed)
+ * `suffix` contains the suffix of the OSM tag, if any. The suffix
+ is the part of the key after the first colon.
+
+ In addition to that, a name may have arbitrary additional attributes.
+ How attributes are used, depends on the sanitizers and token analysers.
+ The exception is is the 'analyzer' attribute. This attribute determines
+ which token analysis module will be used to finalize the treatment of
+ names.
+ """
+
+ def __init__(self, name: str, kind: str, suffix: Optional[str]):
+ self.name = name
+ self.kind = kind
+ self.suffix = suffix
+ self.attr: Dict[str, str] = {}
+
+
+ def __repr__(self) -> str:
+ return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
+
+
+ def clone(self, name: Optional[str] = None,
+ kind: Optional[str] = None,
+ suffix: Optional[str] = None,
+ attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
+ """ Create a deep copy of the place name, optionally with the
+ given parameters replaced. In the attribute list only the given
+ keys are updated. The list is not replaced completely.
+ In particular, the function cannot to be used to remove an
+ attribute from a place name.
+ """
+ newobj = PlaceName(name or self.name,
+ kind or self.kind,
+ suffix or self.suffix)
+
+ newobj.attr.update(self.attr)
+ if attr:
+ newobj.attr.update(attr)
+
+ return newobj
+
+
+ def set_attr(self, key: str, value: str) -> None:
+ """ Add the given property to the name. If the property was already
+ set, then the value is overwritten.
+ """
+ self.attr[key] = value
+
+
+ def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
+ """ Return the given property or the value of 'default' if it
+ is not set.
+ """
+ return self.attr.get(key, default)
+
+
+ def has_attr(self, key: str) -> bool:
+ """ Check if the given attribute is set.
+ """
+ return key in self.attr
import json
import logging
+from icu import Transliterator
+
from nominatim.config import flatten_config_list, Configuration
from nominatim.db.properties import set_property, get_property
from nominatim.db.connection import Connection
from nominatim.errors import UsageError
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
-from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyser
+from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyzer
import nominatim.data.country_info
LOG = logging.getLogger()
if not isinstance(self.analysis_rules, list):
raise UsageError("Configuration section 'token-analysis' must be a list.")
+ norm = Transliterator.createFromRules("rule_loader_normalization",
+ self.normalization_rules)
+ trans = Transliterator.createFromRules("rule_loader_transliteration",
+ self.transliteration_rules)
+
for section in self.analysis_rules:
name = section.get('id', None)
if name in self.analysis:
LOG.fatal("ICU tokenizer configuration has two token "
"analyzers with id '%s'.", name)
raise UsageError("Syntax error in ICU tokenizer config.")
- self.analysis[name] = TokenAnalyzerRule(section,
- self.normalization_rules,
+ self.analysis[name] = TokenAnalyzerRule(section, norm, trans,
self.config)
and creates a new token analyzer on request.
"""
- def __init__(self, rules: Mapping[str, Any], normalization_rules: str,
+ def __init__(self, rules: Mapping[str, Any],
+ normalizer: Any, transliterator: Any,
config: Configuration) -> None:
analyzer_name = _get_section(rules, 'analyzer')
if not analyzer_name or not isinstance(analyzer_name, str):
self._analysis_mod: AnalysisModule = \
config.load_plugin_module(analyzer_name, 'nominatim.tokenizer.token_analysis')
- self.config = self._analysis_mod.configure(rules, normalization_rules)
+ self.config = self._analysis_mod.configure(rules, normalizer,
+ transliterator)
- def create(self, normalizer: Any, transliterator: Any) -> Analyser:
+ def create(self, normalizer: Any, transliterator: Any) -> Analyzer:
""" Create a new analyser instance for the given rule.
"""
return self._analysis_mod.create(normalizer, transliterator, self.config)
from typing import Mapping, Optional, TYPE_CHECKING
from icu import Transliterator
-from nominatim.tokenizer.token_analysis.base import Analyser
+from nominatim.tokenizer.token_analysis.base import Analyzer
if TYPE_CHECKING:
from typing import Any
class ICUTokenAnalysis:
""" Container class collecting the transliterators and token analysis
- modules for a single NameAnalyser instance.
+ modules for a single Analyser instance.
"""
def __init__(self, norm_rules: str, trans_rules: str,
for name, arules in analysis_rules.items()}
- def get_analyzer(self, name: Optional[str]) -> Analyser:
+ def get_analyzer(self, name: Optional[str]) -> Analyzer:
""" Return the given named analyzer. If no analyzer with that
name exists, return the default analyzer.
"""
from nominatim.data.place_info import PlaceInfo
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
-from nominatim.tokenizer.sanitizers.base import PlaceName
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
from nominatim.tokenizer.base import AbstractAnalyzer, AbstractTokenizer
postcode_name = place.name.strip().upper()
variant_base = None
else:
- postcode_name = analyzer.normalize(place.name)
+ postcode_name = analyzer.get_canonical_id(place)
variant_base = place.get_attr("variant")
if variant_base:
if analyzer is None:
variants = [term]
else:
- variants = analyzer.get_variants_ascii(variant)
+ variants = analyzer.compute_variants(variant)
if term not in variants:
variants.append(term)
else:
# Otherwise use the analyzer to determine the canonical name.
# Per convention we use the first variant as the 'lookup name', the
# name that gets saved in the housenumber field of the place.
- norm_name = analyzer.normalize(hnr.name)
- if norm_name:
- result = self._cache.housenumbers.get(norm_name, result)
+ word_id = analyzer.get_canonical_id(hnr)
+ if word_id:
+ result = self._cache.housenumbers.get(word_id, result)
if result[0] is None:
- variants = analyzer.get_variants_ascii(norm_name)
+ variants = analyzer.compute_variants(word_id)
if variants:
with self.conn.cursor() as cur:
cur.execute("SELECT create_analyzed_hnr_id(%s, %s)",
- (norm_name, list(variants)))
+ (word_id, list(variants)))
result = cur.fetchone()[0], variants[0] # type: ignore[no-untyped-call]
- self._cache.housenumbers[norm_name] = result
+ self._cache.housenumbers[word_id] = result
return result
for name in names:
analyzer_id = name.get_attr('analyzer')
analyzer = self.token_analysis.get_analyzer(analyzer_id)
- norm_name = analyzer.normalize(name.name)
+ word_id = analyzer.get_canonical_id(name)
if analyzer_id is None:
- token_id = norm_name
+ token_id = word_id
else:
- token_id = f'{norm_name}@{analyzer_id}'
+ token_id = f'{word_id}@{analyzer_id}'
full, part = self._cache.names.get(token_id, (None, None))
if full is None:
- variants = analyzer.get_variants_ascii(norm_name)
+ variants = analyzer.compute_variants(word_id)
if not variants:
continue
postcode_name = item.name.strip().upper()
variant_base = None
else:
- postcode_name = analyzer.normalize(item.name)
+ postcode_name = analyzer.get_canonical_id(item)
variant_base = item.get_attr("variant")
if variant_base:
variants = {term}
if analyzer is not None and variant_base:
- variants.update(analyzer.get_variants_ascii(variant_base))
+ variants.update(analyzer.compute_variants(variant_base))
with self.conn.cursor() as cur:
cur.execute("SELECT create_postcode_word(%s, %s)",
from nominatim.errors import UsageError
from nominatim.config import Configuration
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
-from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo
+from nominatim.data.place_name import PlaceName
from nominatim.data.place_info import PlaceInfo
"""
Common data types and protocols for sanitizers.
"""
-from typing import Optional, Dict, List, Mapping, Callable
+from typing import Optional, List, Mapping, Callable
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
from nominatim.data.place_info import PlaceInfo
+from nominatim.data.place_name import PlaceName
from nominatim.typing import Protocol, Final
-class PlaceName:
- """ A searchable name for a place together with properties.
- Every name object saves the name proper and two basic properties:
- * 'kind' describes the name of the OSM key used without any suffixes
- (i.e. the part after the colon removed)
- * 'suffix' contains the suffix of the OSM tag, if any. The suffix
- is the part of the key after the first colon.
- In addition to that, the name may have arbitrary additional attributes.
- Which attributes are used, depends on the token analyser.
- """
-
- def __init__(self, name: str, kind: str, suffix: Optional[str]):
- self.name = name
- self.kind = kind
- self.suffix = suffix
- self.attr: Dict[str, str] = {}
-
-
- def __repr__(self) -> str:
- return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
-
-
- def clone(self, name: Optional[str] = None,
- kind: Optional[str] = None,
- suffix: Optional[str] = None,
- attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
- """ Create a deep copy of the place name, optionally with the
- given parameters replaced. In the attribute list only the given
- keys are updated. The list is not replaced completely.
- In particular, the function cannot to be used to remove an
- attribute from a place name.
- """
- newobj = PlaceName(name or self.name,
- kind or self.kind,
- suffix or self.suffix)
-
- newobj.attr.update(self.attr)
- if attr:
- newobj.attr.update(attr)
-
- return newobj
-
-
- def set_attr(self, key: str, value: str) -> None:
- """ Add the given property to the name. If the property was already
- set, then the value is overwritten.
- """
- self.attr[key] = value
-
-
- def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
- """ Return the given property or the value of 'default' if it
- is not set.
- """
- return self.attr.get(key, default)
-
-
- def has_attr(self, key: str) -> bool:
- """ Check if the given attribute is set.
- """
- return key in self.attr
-
class ProcessInfo:
""" Container class for information handed into to handler functions.
def create(self, config: SanitizerConfig) -> Callable[[ProcessInfo], None]:
"""
- A sanitizer must define a single function `create`. It takes the
- dictionary with the configuration information for the sanitizer and
- returns a function that transforms name and address.
+ Create a function for sanitizing a place.
+
+ Arguments:
+ config: A dictionary with the additional configuration options
+ specified in the tokenizer configuration
+
+ Return:
+ The result must be a callable that takes a place description
+ and transforms name and address as reuqired.
"""
from typing import Callable, Iterator, List
import re
-from nominatim.tokenizer.sanitizers.base import ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import ProcessInfo
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
class _HousenumberSanitizer:
_BaseUserDict = UserDict
class SanitizerConfig(_BaseUserDict):
- """ Dictionary with configuration options for a sanitizer.
-
- In addition to the usual dictionary function, the class provides
- accessors to standard sanatizer options that are used by many of the
+ """ The `SanitizerConfig` class is a read-only dictionary
+ with configuration options for the sanitizer.
+ In addition to the usual dictionary functions, the class provides
+ accessors to standard sanitizer options that are used by many of the
sanitizers.
"""
def get_string_list(self, param: str, default: Sequence[str] = tuple()) -> Sequence[str]:
""" Extract a configuration parameter as a string list.
- If the parameter value is a simple string, it is returned as a
- one-item list. If the parameter value does not exist, the given
- default is returned. If the parameter value is a list, it is checked
- to contain only strings before being returned.
+
+ Arguments:
+ param: Name of the configuration parameter.
+ default: Value to return, when the parameter is missing.
+
+ Returns:
+ If the parameter value is a simple string, it is returned as a
+ one-item list. If the parameter value does not exist, the given
+ default is returned. If the parameter value is a list, it is
+ checked to contain only strings before being returned.
"""
values = self.data.get(param, None)
def get_bool(self, param: str, default: Optional[bool] = None) -> bool:
""" Extract a configuration parameter as a boolean.
- The parameter must be one of the yaml boolean values or an
- user error will be raised. If `default` is given, then the parameter
- may also be missing or empty.
+
+ Arguments:
+ param: Name of the configuration parameter. The parameter must
+ contain one of the yaml boolean values or an
+ UsageError will be raised.
+ default: Value to return, when the parameter is missing.
+ When set to `None`, the parameter must be defined.
+
+ Returns:
+ Boolean value of the given parameter.
"""
value = self.data.get(param, default)
def get_delimiter(self, default: str = ',;') -> Pattern[str]:
- """ Return the 'delimiter' parameter in the configuration as a
- compiled regular expression that can be used to split the names on the
- delimiters. The regular expression makes sure that the resulting names
- are stripped and that repeated delimiters
- are ignored but it will still create empty fields on occasion. The
- code needs to filter those.
-
- The 'default' parameter defines the delimiter set to be used when
- not explicitly configured.
+ """ Return the 'delimiters' parameter in the configuration as a
+ compiled regular expression that can be used to split strings on
+ these delimiters.
+
+ Arguments:
+ default: Delimiters to be used when 'delimiters' parameter
+ is not explicitly configured.
+
+ Returns:
+ A regular expression pattern which can be used to
+ split a string. The regular expression makes sure that the
+ resulting names are stripped and that repeated delimiters
+ are ignored. It may still create empty fields on occasion. The
+ code needs to filter those.
"""
delimiter_set = set(self.data.get('delimiters', default))
if not delimiter_set:
def get_filter_kind(self, *default: str) -> Callable[[str], bool]:
""" Return a filter function for the name kind from the 'filter-kind'
- config parameter. The filter functions takes a name item and returns
- True when the item passes the filter.
+ config parameter.
- If the parameter is empty, the filter lets all items pass. If the
- parameter is a string, it is interpreted as a single regular expression
- that must match the full kind string. If the parameter is a list then
+ If the 'filter-kind' parameter is empty, the filter lets all items
+ pass. If the parameter is a string, it is interpreted as a single
+ regular expression that must match the full kind string.
+ If the parameter is a list then
any of the regular expressions in the list must match to pass.
+
+ Arguments:
+ default: Filters to be used, when the 'filter-kind' parameter
+ is not specified. If omitted then the default is to
+ let all names pass.
+
+ Returns:
+ A filter function which takes a name string and returns
+ True when the item passes the filter.
"""
filters = self.get_string_list('filter-kind', default)
from typing import Mapping, List, Any
from nominatim.typing import Protocol
+from nominatim.data.place_name import PlaceName
-class Analyser(Protocol):
- """ Instance of the token analyser.
+class Analyzer(Protocol):
+ """ The `create()` function of an analysis module needs to return an
+ object that implements the following functions.
"""
- def normalize(self, name: str) -> str:
- """ Return the normalized form of the name. This is the standard form
- from which possible variants for the name can be derived.
+ def get_canonical_id(self, name: PlaceName) -> str:
+ """ Return the canonical form of the given name. The canonical ID must
+ be unique (the same ID must always yield the same variants) and
+ must be a form from which the variants can be derived.
+
+ Arguments:
+ name: Extended place name description as prepared by
+ the sanitizers.
+
+ Returns:
+ ID string with a canonical form of the name. The string may
+ be empty, when the analyzer cannot analyze the name at all,
+ for example because the character set in use does not match.
"""
- def get_variants_ascii(self, norm_name: str) -> List[str]:
- """ Compute the spelling variants for the given normalized name
- and transliterate the result.
+ def compute_variants(self, canonical_id: str) -> List[str]:
+ """ Compute the transliterated spelling variants for the given
+ canonical ID.
+
+ Arguments:
+ canonical_id: ID string previously computed with
+ `get_canonical_id()`.
+
+ Returns:
+ A list of possible spelling variants. All strings must have
+ been transformed with the global normalizer and
+ transliterator ICU rules. Otherwise they cannot be matched
+ against the input by the query frontend.
+ The list may be empty, when there are no useful
+ spelling variants. This may happen when an analyzer only
+ usually outputs additional variants to the canonical spelling
+ and there are no such variants.
"""
+
class AnalysisModule(Protocol):
- """ Protocol for analysis modules.
+ """ The setup of the token analysis is split into two parts:
+ configuration and analyser factory. A token analysis module must
+ therefore implement the two functions here described.
"""
- def configure(self, rules: Mapping[str, Any], normalization_rules: str) -> Any:
+ def configure(self, rules: Mapping[str, Any],
+ normalizer: Any, transliterator: Any) -> Any:
""" Prepare the configuration of the analysis module.
This function should prepare all data that can be shared
between instances of this analyser.
+
+ Arguments:
+ rules: A dictionary with the additional configuration options
+ as specified in the tokenizer configuration.
+ normalizer: an ICU Transliterator with the compiled
+ global normalization rules.
+ transliterator: an ICU Transliterator with the compiled
+ global transliteration rules.
+
+ Returns:
+ A data object with configuration data. This will be handed
+ as is into the `create()` function and may be
+ used freely by the analysis module as needed.
"""
- def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyser:
+ def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyzer:
""" Create a new instance of the analyser.
A separate instance of the analyser is created for each thread
when used in multi-threading context.
+
+ Arguments:
+ normalizer: an ICU Transliterator with the compiled normalization
+ rules.
+ transliterator: an ICU Transliterator with the compiled
+ transliteration rules.
+ config: The object that was returned by the call to configure().
+
+ Returns:
+ A new analyzer instance. This must be an object that implements
+ the Analyzer protocol.
"""
import itertools
import re
-from icu import Transliterator
-
from nominatim.config import flatten_config_list
from nominatim.errors import UsageError
def get_variant_config(in_rules: Any,
- normalization_rules: str) -> Tuple[List[Tuple[str, List[str]]], str]:
+ normalizer: Any) -> Tuple[List[Tuple[str, List[str]]], str]:
""" Convert the variant definition from the configuration into
replacement sets.
vset: Set[ICUVariant] = set()
rules = flatten_config_list(in_rules, 'variants')
- vmaker = _VariantMaker(normalization_rules)
+ vmaker = _VariantMaker(normalizer)
for section in rules:
for rule in (section.get('words') or []):
All text in rules is normalized to make sure the variants match later.
"""
- def __init__(self, norm_rules: Any) -> None:
- self.norm = Transliterator.createFromRules("rule_loader_normalization",
- norm_rules)
+ def __init__(self, normalizer: Any) -> None:
+ self.norm = normalizer
def compute(self, rule: Any) -> Iterator[ICUVariant]:
import datrie
from nominatim.errors import UsageError
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.token_analysis.config_variants import get_variant_config
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> Dict[str, Any]:
+def configure(rules: Mapping[str, Any], normalizer: Any, _: Any) -> Dict[str, Any]:
""" Extract and preprocess the configuration for this module.
"""
config: Dict[str, Any] = {}
config['replacements'], config['chars'] = get_variant_config(rules.get('variants'),
- normalization_rules)
+ normalizer)
config['variant_only'] = rules.get('mode', '') == 'variant-only'
# parse mutation rules
self.mutations = [MutationVariantGenerator(*cfg) for cfg in config['mutations']]
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the normalized form of the name. This is the standard form
from which possible variants for the name can be derived.
"""
- return cast(str, self.norm.transliterate(name)).strip()
+ return cast(str, self.norm.transliterate(name.name)).strip()
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized name
and transliterate the result.
"""
Specialized processor for housenumbers. Analyses common housenumber patterns
and creates variants for them.
"""
-from typing import Mapping, Any, List, cast
+from typing import Any, List, cast
import re
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
RE_NON_DIGIT = re.compile('[^0-9]')
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
""" All behaviour is currently hard-coded.
"""
return None
self.mutator = MutationVariantGenerator('␣', (' ', ''))
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the normalized form of the housenumber.
"""
# shortcut for number-only numbers, which make up 90% of the data.
- if RE_NON_DIGIT.search(name) is None:
- return name
+ if RE_NON_DIGIT.search(name.name) is None:
+ return name.name
- norm = cast(str, self.trans.transliterate(self.norm.transliterate(name)))
+ norm = cast(str, self.trans.transliterate(self.norm.transliterate(name.name)))
# If there is a significant non-numeric part, use as is.
if RE_NAMED_PART.search(norm) is None:
# Otherwise add optional spaces between digits and letters.
return norm
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized housenumber.
Generates variants for optional spaces (marked with '␣').
Specialized processor for postcodes. Supports a 'lookup' variant of the
token, which produces variants with optional spaces.
"""
-from typing import Mapping, Any, List
+from typing import Any, List
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
+from nominatim.data.place_name import PlaceName
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
""" All behaviour is currently hard-coded.
"""
return None
""" Special normalization and variant generation for postcodes.
This analyser must not be used with anything but postcodes as
- it follows some special rules: `normalize` doesn't necessarily
- need to return a standard form as per normalization rules. It
- needs to return the canonical form of the postcode that is also
- used for output. `get_variants_ascii` then needs to ensure that
+ it follows some special rules: the canonial ID is the form that
+ is used for the output. `compute_variants` then needs to ensure that
the generated variants once more follow the standard normalization
and transliteration, so that postcodes are correctly recognised by
the search algorithm.
self.mutator = MutationVariantGenerator(' ', (' ', ''))
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the standard form of the postcode.
"""
- return name.strip().upper()
+ return name.name.strip().upper()
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized postcode.
Takes the canonical form of the postcode, normalizes it using the
from icu import Transliterator
import nominatim.tokenizer.token_analysis.postcodes as module
+from nominatim.data.place_name import PlaceName
from nominatim.errors import UsageError
DEFAULT_NORMALIZATION = """ :: NFD ();
def get_normalized_variants(proc, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return proc.get_variants_ascii(norm.transliterate(name).strip())
+ return proc.compute_variants(norm.transliterate(name).strip())
@pytest.mark.parametrize('name,norm', [('12', '12'),
('A 34 ', 'A 34'),
('34-av', '34-AV')])
-def test_normalize(analyser, name, norm):
- assert analyser.normalize(name) == norm
+def test_get_canonical_id(analyser, name, norm):
+ assert analyser.get_canonical_id(PlaceName(name=name, kind='', suffix='')) == norm
@pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
('AB-998', {'ab 998', 'ab998'}),
('23 FGH D3', {'23 fgh d3', '23fgh d3',
'23 fghd3', '23fghd3'})])
-def test_get_variants_ascii(analyser, postcode, variants):
- out = analyser.get_variants_ascii(postcode)
+def test_compute_variants(analyser, postcode, variants):
+ out = analyser.compute_variants(postcode)
assert len(out) == len(set(out))
assert set(out) == variants
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
if variant_only:
rules['mode'] = 'variant-only'
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
return module.create(norm, trans, config)
def get_normalized_variants(proc, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return proc.get_variants_ascii(norm.transliterate(name).strip())
+ return proc.compute_variants(norm.transliterate(name).strip())
def test_no_variants():
rules = { 'analyzer': 'generic' }
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
proc = module.create(norm, trans, config)
@staticmethod
def configure_rules(*variants):
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
- return module.configure(rules, DEFAULT_NORMALIZATION)
+ trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
+ norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ return module.configure(rules, norm, trans)
def get_replacements(self, *variants):
'mutations': [ {'pattern': m[0], 'replacements': m[1]}
for m in mutations]
}
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
self.analysis = module.create(norm, trans, config)
def variants(self, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return set(self.analysis.get_variants_ascii(norm.transliterate(name).strip()))
+ return set(self.analysis.compute_variants(norm.transliterate(name).strip()))
@pytest.mark.parametrize('pattern', ('(capture)', ['a list']))