are in process of consolidating the style. The following rules apply:
* Python code uses the official Python style
- * indention
+ * indentation
* SQL use 2 spaces
* all other file types use 4 spaces
* [BSD style](https://en.wikipedia.org/wiki/Indent_style#Allman_style) for braces
## Development
Vagrant maps the virtual machine's port 8089 to your host machine. Thus you can
-see Nominatim in action on [locahost:8089](http://localhost:8089/nominatim/).
+see Nominatim in action on [localhost:8089](http://localhost:8089/nominatim/).
You edit code on your host machine in any editor you like. There is no need to
restart any software: just refresh your browser window.
endforeach()
ADD_CUSTOM_TARGET(doc
- COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Centos-8.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Centos-8.md
COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-18.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-18.md
COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-20.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-20.md
COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-22.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-22.md
!!! note
The external module is only needed when using the legacy tokenizer.
- If you have choosen the ICU tokenizer, then you can ignore this section
+ If you have chosen the ICU tokenizer, then you can ignore this section
and follow the standard import documentation.
### Option 1: Compiling the library on the database server
### Installing the required packages
-Nginx has no built-in PHP interpreter. You need to use php-fpm as a deamon for
+Nginx has no built-in PHP interpreter. You need to use php-fpm as a daemon for
serving PHP cgi.
On Ubuntu/Debian install nginx and php-fpm with:
* [Ubuntu 20.04](../appendix/Install-on-Ubuntu-20.md)
* [Ubuntu 18.04](../appendix/Install-on-Ubuntu-18.md)
- * [CentOS 8](../appendix/Install-on-Centos-8.md)
These OS-specific instructions can also be found in executable form
in the `vagrant/` directory.
### Software
!!! Warning
- For larger installations you **must have** PostgreSQL 11+ and Postgis 3+
+ For larger installations you **must have** PostgreSQL 11+ and PostGIS 3+
otherwise import and queries will be slow to the point of being unusable.
- Query performance has marked improvements with PostgrSQL 13+ and Postgis 3.2+.
+ Query performance has marked improvements with PostgreSQL 13+ and PostGIS 3.2+.
For compiling:
### Hardware
A minimum of 2GB of RAM is required or installation will fail. For a full
-planet import 64GB of RAM or more are strongly recommended. Do not report
+planet import 128GB of RAM or more are strongly recommended. Do not report
out of memory problems if you have less than 64GB RAM.
-For a full planet install you will need at least 900GB of hard disk space.
+For a full planet install you will need at least 1TB of hard disk space.
Take into account that the OSM database is growing fast.
Fast disks are essential. Using NVME disks is recommended.
fsync = off
full_page_writes = off
-Don't forget to reenable them after the initial import or you risk database
+Don't forget to re-enable them after the initial import or you risk database
corruption.
# If no endpoint is given, then use search.
RewriteRule ^(/|$) "search.php"
- # If format-html is explicity requested, forward to the UI.
+ # If format-html is explicitly requested, forward to the UI.
RewriteCond %{QUERY_STRING} "format=html"
RewriteRule ^([^/]+)(.php)? ui/$1.html [R,END]
a replication source with an update frequency that is an order of magnitude
lower. For example, if you want to update once a day, use an hourly updated
source. This makes sure that you don't miss an entire day of updates when
- the source is unexpectely late to publish its update.
+ the source is unexpectedly late to publish its update.
If you want to use the source with the same update frequency (e.g. a daily
updated source with daily updates), use the
removed and reimported while updating the database with fresh OSM data.
It is thus not useful to treat it as permanent for later use.
-The combination `osm_type`+`osm_id` is slighly better but remember in
+The combination `osm_type`+`osm_id` is slightly better but remember in
OpenStreetMap mappers can delete, split, recreate places (and those
get a new `osm_id`), there is no link between those old and new ids.
Places can also change their meaning without changing their `osm_id`,
* city_district, district, borough, suburb, subdivision
* hamlet, croft, isolated_dwelling
* neighbourhood, allotments, quarter
- * city_block, residental, farm, farmyard, industrial, commercial, retail
+ * city_block, residential, farm, farmyard, industrial, commercial, retail
* road
* house_number, house_name
* emergency, historic, military, natural, landuse, place, railway,
in the [Import section](../admin/Import.md#filtering-imported-data). These
standard styles may be referenced by their name.
-You can also create your own custom syle. Put the style file into your
+You can also create your own custom style. Put the style file into your
project directory and then set `NOMINATIM_IMPORT_STYLE` to the name of the file.
It is always recommended to start with one of the standard styles and customize
those. You find the standard styles under the name `import-<stylename>.style`
Each country is assigned a partition number in the country_name table (see
below) and the data is then split between a set of tables, one for each
partition. Note that Nominatim still manually manages partitioned tables.
-Native support for partitions in PostgreSQL only became useable with version 13.
+Native support for partitions in PostgreSQL only became usable with version 13.
It will be a little while before Nominatim drops support for older versions.

default languages and saves the assignment of countries to partitions.
* `country_osm_grid` provides a fallback for country geometries
-## Auxilary data tables
+## Auxiliary data tables
-Finally there are some table for auxillary data:
+Finally there are some table for auxiliary data:
* `location_property_tiger` - saves housenumber from the Tiger import. Its
layout is similar to that of `location_propoerty_osmline`.
# Setting up Nominatim for Development
-This chapter gives an overview how to set up Nominatim for developement
+This chapter gives an overview how to set up Nominatim for development
and how to run tests.
!!! Important
The documentation is built with mkdocs:
* [mkdocs](https://www.mkdocs.org/) >= 1.1.2
-* [mkdocstrings](https://mkdocstrings.github.io/)
+* [mkdocstrings](https://mkdocstrings.github.io/) >= 0.16
+* [mkdocstrings-python-legacy](https://mkdocstrings.github.io/python-legacy/)
### Installing prerequisites on Ubuntu/Debian
--- /dev/null
+# Writing custom sanitizer and token analysis modules for the ICU tokenizer
+
+The [ICU tokenizer](../customize/Tokenizers.md#icu-tokenizer) provides a
+highly customizable method to pre-process and normalize the name information
+of the input data before it is added to the search index. It comes with a
+selection of sanitizers and token analyzers which you can use to adapt your
+installation to your needs. If the provided modules are not enough, you can
+also provide your own implementations. This section describes the API
+of sanitizers and token analysis.
+
+!!! warning
+ This API is currently in early alpha status. While this API is meant to
+ be a public API on which other sanitizers and token analyzers may be
+ implemented, it is not guaranteed to be stable at the moment.
+
+
+## Using non-standard sanitizers and token analyzers
+
+Sanitizer names (in the `step` property) and token analysis names (in the
+`analyzer`) may refer to externally supplied modules. There are two ways
+to include external modules: through a library or from the project directory.
+
+To include a module from a library, use the absolute import path as name and
+make sure the library can be found in your PYTHONPATH.
+
+To use a custom module without creating a library, you can put the module
+somewhere in your project directory and then use the relative path to the
+file. Include the whole name of the file including the `.py` ending.
+
+## Custom sanitizer modules
+
+A sanitizer module must export a single factory function `create` with the
+following signature:
+
+``` python
+def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None]
+```
+
+The function receives the custom configuration for the sanitizer and must
+return a callable (function or class) that transforms the name and address
+terms of a place. When a place is processed, then a `ProcessInfo` object
+is created from the information that was queried from the database. This
+object is sequentially handed to each configured sanitizer, so that each
+sanitizer receives the result of processing from the previous sanitizer.
+After the last sanitizer is finished, the resulting name and address lists
+are forwarded to the token analysis module.
+
+Sanitizer functions are instantiated once and then called for each place
+that is imported or updated. They don't need to be thread-safe.
+If multi-threading is used, each thread creates their own instance of
+the function.
+
+### Sanitizer configuration
+
+::: nominatim.tokenizer.sanitizers.config.SanitizerConfig
+ rendering:
+ show_source: no
+ heading_level: 6
+
+### The main filter function of the sanitizer
+
+The filter function receives a single object of type `ProcessInfo`
+which has with three members:
+
+ * `place`: read-only information about the place being processed.
+ See PlaceInfo below.
+ * `names`: The current list of names for the place. Each name is a
+ PlaceName object.
+ * `address`: The current list of address names for the place. Each name
+ is a PlaceName object.
+
+While the `place` member is provided for information only, the `names` and
+`address` lists are meant to be manipulated by the sanitizer. It may add and
+remove entries, change information within a single entry (for example by
+adding extra attributes) or completely replace the list with a different one.
+
+#### PlaceInfo - information about the place
+
+::: nominatim.data.place_info.PlaceInfo
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+#### PlaceName - extended naming information
+
+::: nominatim.data.place_name.PlaceName
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+### Example: Filter for US street prefixes
+
+The following sanitizer removes the directional prefixes from street names
+in the US:
+
+``` python
+import re
+
+def _filter_function(obj):
+ if obj.place.country_code == 'us' \
+ and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
+ for name in obj.names:
+ name.name = re.sub(r'^(north|south|west|east) ',
+ '',
+ name.name,
+ flags=re.IGNORECASE)
+
+def create(config):
+ return _filter_function
+```
+
+This is the most simple form of a sanitizer module. If defines a single
+filter function and implements the required `create()` function by returning
+the filter.
+
+The filter function first checks if the object is interesting for the
+sanitizer. Namely it checks if the place is in the US (through `country_code`)
+and it the place is a street (a `rank_address` of 26 or 27). If the
+conditions are met, then it goes through all available names and
+removes any leading directional prefix using a simple regular expression.
+
+Save the source code in a file in your project directory, for example as
+`us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`:
+
+``` yaml
+...
+sanitizers:
+ - step: us_streets.py
+...
+```
+
+!!! warning
+ This example is just a simplified show case on how to create a sanitizer.
+ It is not really read for real-world use: while the sanitizer would
+ correcly transform `West 5th Street` into `5th Street`. it would also
+ shorten a simple `North Street` to `Street`.
+
+For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
+They can be found in the directory
+[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers).
+
+
+## Custom token analysis module
+
+::: nominatim.tokenizer.token_analysis.base.AnalysisModule
+ rendering:
+ show_source: no
+ heading_level: 6
+
+
+::: nominatim.tokenizer.token_analysis.base.Analyzer
+ rendering:
+ show_source: no
+ heading_level: 6
+
+### Example: Creating acronym variants for long names
+
+The following example of a token analysis module creates acronyms from
+very long names and adds them as a variant:
+
+``` python
+class AcronymMaker:
+ """ This class is the actual analyzer.
+ """
+ def __init__(self, norm, trans):
+ self.norm = norm
+ self.trans = trans
+
+
+ def get_canonical_id(self, name):
+ # In simple cases, the normalized name can be used as a canonical id.
+ return self.norm.transliterate(name.name).strip()
+
+
+ def compute_variants(self, name):
+ # The transliterated form of the name always makes up a variant.
+ variants = [self.trans.transliterate(name)]
+
+ # Only create acronyms from very long words.
+ if len(name) > 20:
+ # Take the first letter from each word to form the acronym.
+ acronym = ''.join(w[0] for w in name.split())
+ # If that leds to an acronym with at least three letters,
+ # add the resulting acronym as a variant.
+ if len(acronym) > 2:
+ # Never forget to transliterate the variants before returning them.
+ variants.append(self.trans.transliterate(acronym))
+
+ return variants
+
+# The following two functions are the module interface.
+
+def configure(rules, normalizer, transliterator):
+ # There is no configuration to parse and no data to set up.
+ # Just return an empty configuration.
+ return None
+
+
+def create(normalizer, transliterator, config):
+ # Return a new instance of our token analysis class above.
+ return AcronymMaker(normalizer, transliterator)
+```
+
+Given the name `Trans-Siberian Railway`, the code above would return the full
+name `Trans-Siberian Railway` and the acronym `TSR` as variant, so that
+searching would work for both.
+
+## Sanitizers vs. Token analysis - what to use for variants?
+
+It is not always clear when to implement variations in the sanitizer and
+when to write a token analysis module. Just take the acronym example
+above: it would also have been possible to write a sanitizer which adds the
+acronym as an additional name to the name list. The result would have been
+similar. So which should be used when?
+
+The most important thing to keep in mind is that variants created by the
+token analysis are only saved in the word lookup table. They do not need
+extra space in the search index. If there are many spelling variations, this
+can mean quite a significant amount of space is saved.
+
+When creating additional names with a sanitizer, these names are completely
+independent. In particular, they can be fed into different token analysis
+modules. This gives a much greater flexibility but at the price that the
+additional names increase the size of the search index.
+
If the tokenizer has a default configuration file, this should be saved in
the `settings/<NAME>_tokenizer.<SUFFIX>`.
-### Configuration and Persistance
+### Configuration and Persistence
Tokenizers may define custom settings for their configuration. All settings
must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
## US Census TIGER
-For the United States you can choose to import additonal street-level data.
+For the United States you can choose to import additional street-level data.
The data isn't mixed into OSM data but queried as fallback when no OSM
result can be found.
background-color: #eee;
}
-/* Indentation for mkdocstrings.
-div.doc-contents:not(.first) {
- padding-left: 25px;
- border-left: 4px solid rgba(230, 230, 230);
- margin-bottom: 60px;
-}*/
+.doc-object h6 {
+ margin-bottom: 0.8em;
+ font-size: 120%;
+}
+.doc-object {
+ margin-bottom: 1.3em;
+}
- 'Database Layout' : 'develop/Database-Layout.md'
- 'Indexing' : 'develop/Indexing.md'
- 'Tokenizers' : 'develop/Tokenizers.md'
+ - 'Custom modules for ICU tokenizer': 'develop/ICU-Tokenizer-Modules.md'
- 'Setup for Development' : 'develop/Development-Environment.md'
- 'Testing' : 'develop/Testing.md'
- 'External Data Sources': 'develop/data-sources.md'
- search
- mkdocstrings:
handlers:
- python:
+ python-legacy:
rendering:
show_source: false
show_signature_annotations: false
$this->bFallback = $oParams->getBool('fallback', $this->bFallback);
- // List of excluded Place IDs - used for more acurate pageing
+ // List of excluded Place IDs - used for more accurate pageing
$sExcluded = $oParams->getStringList('exclude_place_ids');
if ($sExcluded) {
foreach ($sExcluded as $iExcludedPlaceID) {
public function getBool($sName, $bDefault = false)
{
- if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+ if (!isset($this->aParams[$sName])
+ || !is_string($this->aParams[$sName])
+ || strlen($this->aParams[$sName]) == 0
+ ) {
return $bDefault;
}
public function getInt($sName, $bDefault = false)
{
- if (!isset($this->aParams[$sName])) {
+ if (!isset($this->aParams[$sName]) || is_array($this->aParams[$sName])) {
return $bDefault;
}
public function getFloat($sName, $bDefault = false)
{
- if (!isset($this->aParams[$sName])) {
+ if (!isset($this->aParams[$sName]) || is_array($this->aParams[$sName])) {
return $bDefault;
}
public function getString($sName, $bDefault = false)
{
- if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+ if (!isset($this->aParams[$sName])
+ || !is_string($this->aParams[$sName])
+ || strlen($this->aParams[$sName]) == 0
+ ) {
return $bDefault;
}
public function getSet($sName, $aValues, $sDefault = false)
{
- if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+ if (!isset($this->aParams[$sName])
+ || !is_string($this->aParams[$sName])
+ || strlen($this->aParams[$sName]) == 0
+ ) {
return $sDefault;
}
}
/**
- * Get the orginal phrase of the string.
+ * Get the original phrase of the string.
*/
public function getPhrase()
{
// starts if the search is on POI or street level,
// searches for the nearest POI or street,
// if a street is found and a POI is searched for,
- // the nearest POI which the found street is a parent of is choosen.
+ // the nearest POI which the found street is a parent of is chosen.
$sSQL = 'select place_id,parent_place_id,rank_address,country_code,';
$sSQL .= ' ST_distance('.$sPointSQL.', geometry) as distance';
$sSQL .= ' FROM ';
// We can't reliably go from the closest street to an
// interpolation line because the closest interpolation
// may have a different street segments as a parent.
- // Therefore allow an interpolation line to take precendence
+ // Therefore allow an interpolation line to take precedence
// even when the street is closer.
$fDistance = $iRankAddress < 28 ? 0.001 : $aPlace['distance'];
}
* Add the given full-word token to the list of terms to search for in the
* name.
*
- * @param interger iId ID of term to add.
+ * @param integer iId ID of term to add.
* @param bool bRareName True if the term is infrequent enough to not
* require other constraints for efficient search.
*/
*
* @return mixed[] An array with two fields: IDs contains the list of
* matching place IDs and houseNumber the houseNumber
- * if appicable or -1 if not.
+ * if applicable or -1 if not.
*/
public function query(&$oDB, $iMinRank, $iMaxRank, $iLimit)
{
public function extendSearch($oSearch, $oPosition)
{
// Full words can only be a name if they appear at the beginning
- // of the phrase. In structured search the name must forcably in
+ // of the phrase. In structured search the name must forcibly in
// the first phrase. In unstructured search it may be in a later
// phrase when the first phrase is a house number.
if ($oSearch->hasName()
showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is missing');
}
if ($aCounts[$aLine[0]] > $aLine[3]) {
- showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is pressent too many times');
+ showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is present too many times');
}
if ($aLine[6] == 'bool' && !array_key_exists($aLine[0], $aResult)) {
$aResult[$aLine[0]] = false;
function loadSettings($sProjectDir)
{
@define('CONST_InstallDir', $sProjectDir);
- // Temporary hack to set the direcory via environment instead of
+ // Temporary hack to set the directory via environment instead of
// the installed scripts. Neither setting is part of the official
// set of settings.
defined('CONST_ConfigDir') or define('CONST_ConfigDir', $_SERVER['NOMINATIM_CONFIGDIR']);
$aLinkedLines = $oDB->getAll($sSQL);
}
-// All places this is an imediate parent of
+// All places this is an immediate parent of
$aHierarchyLines = false;
if ($bIncludeHierarchy) {
$sSQL = 'SELECT obj.place_id, osm_type, osm_id, class, type, housenumber,';
centroid GEOMETRY
);
--- feature intersects geoemtry
+-- feature intersects geometry
-- for areas and linestrings they must touch at least along a line
CREATE OR REPLACE FUNCTION is_relevant_geometry(de9im TEXT, geom_type TEXT)
RETURNS BOOLEAN
and rank_search = 30 AND ST_GeometryType(geometry) in ('ST_Polygon','ST_MultiPolygon')
LIMIT 1;
ELSE
- -- See if we can inherit addtional address tags from an interpolation.
+ -- See if we can inherit additional address tags from an interpolation.
-- These will become permanent.
FOR location IN
SELECT (address - 'interpolation'::text - 'housenumber'::text) as address
{% if debug %}RAISE WARNING 'Using full index mode for % %', NEW.osm_type, NEW.osm_id;{% endif %}
IF linked_place is not null THEN
-- Recompute the ranks here as the ones from the linked place might
- -- have been shifted to accomodate surrounding boundaries.
+ -- have been shifted to accommodate surrounding boundaries.
SELECT place_id, osm_id, class, type, extratags,
centroid, geometry,
(compute_place_rank(country_code, osm_type, class, type, admin_level,
THEN
-- Update the list of country names.
-- Only take the name from the largest area for the given country code
- -- in the hope that this is the authoritive one.
+ -- in the hope that this is the authoritative one.
-- Also replace any old names so that all mapping mistakes can
-- be fixed through regular OSM updates.
FOR location IN
NEW.postcode := get_nearest_postcode(NEW.country_code, NEW.geometry);
END IF;
- {% if debug %}RAISE WARNING 'place update % % finsihed.', NEW.osm_type, NEW.osm_id;{% endif %}
+ {% if debug %}RAISE WARNING 'place update % % finished.', NEW.osm_type, NEW.osm_id;{% endif %}
NEW.token_info := token_strip_info(NEW.token_info);
RETURN NEW;
#!/bin/sh
#
-# Plugin to monitor the types of requsts made to the API
+# Plugin to monitor the types of requests made to the API
#
# Can be configured through libpq environment variables, for example
# PGUSER, PGDATABASE, etc. See man page of psql for more information.
Nominatim configuration accessor.
"""
from typing import Dict, Any, List, Mapping, Optional
+import importlib.util
import logging
import os
+import sys
from pathlib import Path
import json
import yaml
data: Path
self.lib_dir = _LibDirs()
+ self._private_plugins: Dict[str, object] = {}
def set_libdirs(self, **kwargs: StrPath) -> None:
config: Optional[str] = None) -> Any:
""" Load additional configuration from a file. `filename` is the name
of the configuration file. The file is first searched in the
- project directory and then in the global settings dirctory.
+ project directory and then in the global settings directory.
If `config` is set, then the name of the configuration file can
be additionally given through a .env configuration option. When
return result
+ def load_plugin_module(self, module_name: str, internal_path: str) -> Any:
+ """ Load a Python module as a plugin.
+
+ The module_name may have three variants:
+
+ * A name without any '.' is assumed to be an internal module
+ and will be searched relative to `internal_path`.
+ * If the name ends in `.py`, module_name is assumed to be a
+ file name relative to the project directory.
+ * Any other name is assumed to be an absolute module name.
+
+ In either of the variants the module name must start with a letter.
+ """
+ if not module_name or not module_name[0].isidentifier():
+ raise UsageError(f'Invalid module name {module_name}')
+
+ if '.' not in module_name:
+ module_name = module_name.replace('-', '_')
+ full_module = f'{internal_path}.{module_name}'
+ return sys.modules.get(full_module) or importlib.import_module(full_module)
+
+ if module_name.endswith('.py'):
+ if self.project_dir is None or not (self.project_dir / module_name).exists():
+ raise UsageError(f"Cannot find module '{module_name}' in project directory.")
+
+ if module_name in self._private_plugins:
+ return self._private_plugins[module_name]
+
+ file_path = str(self.project_dir / module_name)
+ spec = importlib.util.spec_from_file_location(module_name, file_path)
+ if spec:
+ module = importlib.util.module_from_spec(spec)
+ # Do not add to global modules because there is no standard
+ # module name that Python can resolve.
+ self._private_plugins[module_name] = module
+ assert spec.loader is not None
+ spec.loader.exec_module(module)
+
+ return module
+
+ return sys.modules.get(module_name) or importlib.import_module(module_name)
+
+
def find_config_file(self, filename: StrPath,
config: Optional[str] = None) -> Path:
""" Resolve the location of a configuration file given a filename and
""" Handler for the '!include' operator in YAML files.
When the filename is relative, then the file is first searched in the
- project directory and then in the global settings dirctory.
+ project directory and then in the global settings directory.
"""
fname = loader.construct_scalar(node)
from typing import Optional, Mapping, Any
class PlaceInfo:
- """ Data class containing all information the tokenizer gets about a
- place it should process the names for.
+ """ This data class contains all information the tokenizer can access
+ about a place.
"""
def __init__(self, info: Mapping[str, Any]) -> None:
@property
def name(self) -> Optional[Mapping[str, str]]:
- """ A dictionary with the names of the place or None if the place
- has no names.
+ """ A dictionary with the names of the place. Keys and values represent
+ the full key and value of the corresponding OSM tag. Which tags
+ are saved as names is determined by the import style.
+ The property may be None if the place has no names.
"""
return self._info.get('name')
@property
def address(self) -> Optional[Mapping[str, str]]:
- """ A dictionary with the address elements of the place
- or None if no address information is available.
+ """ A dictionary with the address elements of the place. They key
+ usually corresponds to the suffix part of the key of an OSM
+ 'addr:*' or 'isin:*' tag. There are also some special keys like
+ `country` or `country_code` which merge OSM keys that contain
+ the same information. See [Import Styles][1] for details.
+
+ The property may be None if the place has no address information.
+
+ [1]: ../customize/Import-Styles.md
"""
return self._info.get('address')
@property
def country_code(self) -> Optional[str]:
""" The country code of the country the place is in. Guaranteed
- to be a two-letter lower-case string or None, if no country
- could be found.
+ to be a two-letter lower-case string. If the place is not inside
+ any country, the property is set to None.
"""
return self._info.get('country_code')
@property
def rank_address(self) -> int:
- """ The computed rank address before rank correction.
+ """ The [rank address][1] before ant rank correction is applied.
+
+ [1]: ../customize/Ranking.md#address-rank
"""
return self._info.get('rank_address', 0)
def is_a(self, key: str, value: str) -> bool:
- """ Check if the place's primary tag corresponds to the given
+ """ Set to True when the place's primary tag corresponds to the given
key and value.
"""
return self._info.get('class') == key and self._info.get('type') == value
def is_country(self) -> bool:
- """ Check if the place is a valid country boundary.
+ """ Set to True when the place is a valid country boundary.
"""
return self.rank_address == 4 \
and self.is_a('boundary', 'administrative') \
--- /dev/null
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This file is part of Nominatim. (https://nominatim.org)
+#
+# Copyright (C) 2022 by the Nominatim developer community.
+# For a full list of authors see the git log.
+"""
+Data class for a single name of a place.
+"""
+from typing import Optional, Dict, Mapping
+
+class PlaceName:
+ """ Each name and address part of a place is encapsulated in an object of
+ this class. It saves not only the name proper but also describes the
+ kind of name with two properties:
+
+ * `kind` describes the name of the OSM key used without any suffixes
+ (i.e. the part after the colon removed)
+ * `suffix` contains the suffix of the OSM tag, if any. The suffix
+ is the part of the key after the first colon.
+
+ In addition to that, a name may have arbitrary additional attributes.
+ How attributes are used, depends on the sanitizers and token analysers.
+ The exception is is the 'analyzer' attribute. This attribute determines
+ which token analysis module will be used to finalize the treatment of
+ names.
+ """
+
+ def __init__(self, name: str, kind: str, suffix: Optional[str]):
+ self.name = name
+ self.kind = kind
+ self.suffix = suffix
+ self.attr: Dict[str, str] = {}
+
+
+ def __repr__(self) -> str:
+ return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
+
+
+ def clone(self, name: Optional[str] = None,
+ kind: Optional[str] = None,
+ suffix: Optional[str] = None,
+ attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
+ """ Create a deep copy of the place name, optionally with the
+ given parameters replaced. In the attribute list only the given
+ keys are updated. The list is not replaced completely.
+ In particular, the function cannot to be used to remove an
+ attribute from a place name.
+ """
+ newobj = PlaceName(name or self.name,
+ kind or self.kind,
+ suffix or self.suffix)
+
+ newobj.attr.update(self.attr)
+ if attr:
+ newobj.attr.update(attr)
+
+ return newobj
+
+
+ def set_attr(self, key: str, value: str) -> None:
+ """ Add the given property to the name. If the property was already
+ set, then the value is overwritten.
+ """
+ self.attr[key] = value
+
+
+ def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
+ """ Return the given property or the value of 'default' if it
+ is not set.
+ """
+ return self.attr.get(key, default)
+
+
+ def has_attr(self, key: str) -> bool:
+ """ Check if the given attribute is set.
+ """
+ return key in self.attr
def drop_table(self, name: str, if_exists: bool = True, cascade: bool = False) -> None:
""" Drop the table with the given name.
- Set `if_exists` to False if a non-existant table should raise
+ Set `if_exists` to False if a non-existent table should raise
an exception instead of just being ignored. If 'cascade' is set
to True then all dependent tables are deleted as well.
"""
def drop_table(self, name: str, if_exists: bool = True, cascade: bool = False) -> None:
""" Drop the table with the given name.
- Set `if_exists` to False if a non-existant table should raise
+ Set `if_exists` to False if a non-existent table should raise
an exception instead of just being ignored.
"""
with self.cursor() as cur:
from nominatim.db.connection import Connection
def set_property(conn: Connection, name: str, value: str) -> None:
- """ Add or replace the propery with the given name.
+ """ Add or replace the property with the given name.
"""
with conn.cursor() as cur:
cur.execute('SELECT value FROM nominatim_properties WHERE property = %s',
def index_postcodes(self) -> None:
- """Index the entries ofthe location_postcode table.
+ """Index the entries of the location_postcode table.
"""
LOG.warning("Starting indexing postcodes using %s threads", self.num_threads)
# asynchronously get the next batch
has_more = fetcher.fetch_next_batch(cur, runner)
- # And insert the curent batch
+ # And insert the current batch
for idx in range(0, len(places), batch):
part = places[idx:idx + batch]
LOG.debug("Processing places: %s", str(part))
""" Tracks and prints progress for the indexing process.
`name` is the name of the indexing step being tracked.
`total` sets up the total number of items that need processing.
- `log_interval` denotes the interval in seconds at which progres
+ `log_interval` denotes the interval in seconds at which progress
should be reported.
"""
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
-Abstract class defintions for tokenizers. These base classes are here
+Abstract class definitions for tokenizers. These base classes are here
mainly for documentation purposes.
"""
from abc import ABC, abstractmethod
the search index.
Arguments:
- place: Place information retrived from the database.
+ place: Place information retrieved from the database.
Returns:
A JSON-serialisable structure that will be handed into
init_db: When set to False, then initialisation of database
tables should be skipped. This option is only required for
- migration purposes and can be savely ignored by custom
+ migration purposes and can be safely ignored by custom
tokenizers.
TODO: can we move the init_db parameter somewhere else?
existing database.
A tokenizer is something that is bound to the lifetime of a database. It
-can be choosen and configured before the intial import but then needs to
+can be chosen and configured before the initial import but then needs to
be used consistently when querying and updating the database.
This module provides the functions to create and configure a new tokenizer
-as well as instanciating the appropriate tokenizer for updating an existing
+as well as instantiating the appropriate tokenizer for updating an existing
database.
A tokenizer usually also includes PHP code for querying. The appropriate PHP
Helper class to create ICU rules from a configuration file.
"""
from typing import Mapping, Any, Dict, Optional
-import importlib
import io
import json
import logging
+from icu import Transliterator
+
from nominatim.config import flatten_config_list, Configuration
from nominatim.db.properties import set_property, get_property
from nominatim.db.connection import Connection
from nominatim.errors import UsageError
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
-from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyser
+from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyzer
import nominatim.data.country_info
LOG = logging.getLogger()
"""
def __init__(self, config: Configuration) -> None:
+ self.config = config
rules = config.load_sub_configuration('icu_tokenizer.yaml',
config='TOKENIZER_CONFIG')
def make_sanitizer(self) -> PlaceSanitizer:
""" Create a place sanitizer from the configured rules.
"""
- return PlaceSanitizer(self.sanitizer_rules)
+ return PlaceSanitizer(self.sanitizer_rules, self.config)
def make_token_analysis(self) -> ICUTokenAnalysis:
if not isinstance(self.analysis_rules, list):
raise UsageError("Configuration section 'token-analysis' must be a list.")
+ norm = Transliterator.createFromRules("rule_loader_normalization",
+ self.normalization_rules)
+ trans = Transliterator.createFromRules("rule_loader_transliteration",
+ self.transliteration_rules)
+
for section in self.analysis_rules:
name = section.get('id', None)
if name in self.analysis:
LOG.fatal("ICU tokenizer configuration has two token "
"analyzers with id '%s'.", name)
raise UsageError("Syntax error in ICU tokenizer config.")
- self.analysis[name] = TokenAnalyzerRule(section, self.normalization_rules)
+ self.analysis[name] = TokenAnalyzerRule(section, norm, trans,
+ self.config)
@staticmethod
and creates a new token analyzer on request.
"""
- def __init__(self, rules: Mapping[str, Any], normalization_rules: str) -> None:
- # Find the analysis module
- module_name = 'nominatim.tokenizer.token_analysis.' \
- + _get_section(rules, 'analyzer').replace('-', '_')
- self._analysis_mod: AnalysisModule = importlib.import_module(module_name)
+ def __init__(self, rules: Mapping[str, Any],
+ normalizer: Any, transliterator: Any,
+ config: Configuration) -> None:
+ analyzer_name = _get_section(rules, 'analyzer')
+ if not analyzer_name or not isinstance(analyzer_name, str):
+ raise UsageError("'analyzer' parameter needs to be simple string")
+
+ self._analysis_mod: AnalysisModule = \
+ config.load_plugin_module(analyzer_name, 'nominatim.tokenizer.token_analysis')
+
+ self.config = self._analysis_mod.configure(rules, normalizer,
+ transliterator)
- # Load the configuration.
- self.config = self._analysis_mod.configure(rules, normalization_rules)
- def create(self, normalizer: Any, transliterator: Any) -> Analyser:
+ def create(self, normalizer: Any, transliterator: Any) -> Analyzer:
""" Create a new analyser instance for the given rule.
"""
return self._analysis_mod.create(normalizer, transliterator, self.config)
from typing import Mapping, Optional, TYPE_CHECKING
from icu import Transliterator
-from nominatim.tokenizer.token_analysis.base import Analyser
+from nominatim.tokenizer.token_analysis.base import Analyzer
if TYPE_CHECKING:
from typing import Any
class ICUTokenAnalysis:
""" Container class collecting the transliterators and token analysis
- modules for a single NameAnalyser instance.
+ modules for a single Analyser instance.
"""
def __init__(self, norm_rules: str, trans_rules: str,
for name, arules in analysis_rules.items()}
- def get_analyzer(self, name: Optional[str]) -> Analyser:
+ def get_analyzer(self, name: Optional[str]) -> Analyzer:
""" Return the given named analyzer. If no analyzer with that
name exists, return the default analyzer.
"""
from nominatim.data.place_info import PlaceInfo
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
-from nominatim.tokenizer.sanitizers.base import PlaceName
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
from nominatim.tokenizer.base import AbstractAnalyzer, AbstractTokenizer
class ICUTokenizer(AbstractTokenizer):
- """ This tokenizer uses libICU to covert names and queries to ASCII.
+ """ This tokenizer uses libICU to convert names and queries to ASCII.
Otherwise it uses the same algorithms and data structures as the
normalization routines in Nominatim 3.
"""
postcode_name = place.name.strip().upper()
variant_base = None
else:
- postcode_name = analyzer.normalize(place.name)
+ postcode_name = analyzer.get_canonical_id(place)
variant_base = place.get_attr("variant")
if variant_base:
if analyzer is None:
variants = [term]
else:
- variants = analyzer.get_variants_ascii(variant)
+ variants = analyzer.compute_variants(variant)
if term not in variants:
variants.append(term)
else:
def _remove_special_phrases(self, cursor: Cursor,
new_phrases: Set[Tuple[str, str, str, str]],
existing_phrases: Set[Tuple[str, str, str, str]]) -> int:
- """ Remove all phrases from the databse that are no longer in the
+ """ Remove all phrases from the database that are no longer in the
new phrase list.
"""
to_delete = existing_phrases - new_phrases
# Otherwise use the analyzer to determine the canonical name.
# Per convention we use the first variant as the 'lookup name', the
# name that gets saved in the housenumber field of the place.
- norm_name = analyzer.normalize(hnr.name)
- if norm_name:
- result = self._cache.housenumbers.get(norm_name, result)
+ word_id = analyzer.get_canonical_id(hnr)
+ if word_id:
+ result = self._cache.housenumbers.get(word_id, result)
if result[0] is None:
- variants = analyzer.get_variants_ascii(norm_name)
+ variants = analyzer.compute_variants(word_id)
if variants:
with self.conn.cursor() as cur:
cur.execute("SELECT create_analyzed_hnr_id(%s, %s)",
- (norm_name, list(variants)))
+ (word_id, list(variants)))
result = cur.fetchone()[0], variants[0] # type: ignore[no-untyped-call]
- self._cache.housenumbers[norm_name] = result
+ self._cache.housenumbers[word_id] = result
return result
def _retrieve_full_tokens(self, name: str) -> List[int]:
""" Get the full name token for the given name, if it exists.
- The name is only retrived for the standard analyser.
+ The name is only retrieved for the standard analyser.
"""
assert self.conn is not None
norm_name = self._search_normalized(name)
for name in names:
analyzer_id = name.get_attr('analyzer')
analyzer = self.token_analysis.get_analyzer(analyzer_id)
- norm_name = analyzer.normalize(name.name)
+ word_id = analyzer.get_canonical_id(name)
if analyzer_id is None:
- token_id = norm_name
+ token_id = word_id
else:
- token_id = f'{norm_name}@{analyzer_id}'
+ token_id = f'{word_id}@{analyzer_id}'
full, part = self._cache.names.get(token_id, (None, None))
if full is None:
- variants = analyzer.get_variants_ascii(norm_name)
+ variants = analyzer.compute_variants(word_id)
if not variants:
continue
postcode_name = item.name.strip().upper()
variant_base = None
else:
- postcode_name = analyzer.normalize(item.name)
+ postcode_name = analyzer.get_canonical_id(item)
variant_base = item.get_attr("variant")
if variant_base:
variants = {term}
if analyzer is not None and variant_base:
- variants.update(analyzer.get_variants_ascii(variant_base))
+ variants.update(analyzer.compute_variants(variant_base))
with self.conn.cursor() as cur:
cur.execute("SELECT create_postcode_word(%s, %s)",
is handed to the token analysis.
"""
from typing import Optional, List, Mapping, Sequence, Callable, Any, Tuple
-import importlib
from nominatim.errors import UsageError
+from nominatim.config import Configuration
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
-from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo
+from nominatim.data.place_name import PlaceName
from nominatim.data.place_info import PlaceInfo
names and address before they are used by the token analysers.
"""
- def __init__(self, rules: Optional[Sequence[Mapping[str, Any]]]) -> None:
+ def __init__(self, rules: Optional[Sequence[Mapping[str, Any]]],
+ config: Configuration) -> None:
self.handlers: List[Callable[[ProcessInfo], None]] = []
if rules:
for func in rules:
if 'step' not in func:
raise UsageError("Sanitizer rule is missing the 'step' attribute.")
- module_name = 'nominatim.tokenizer.sanitizers.' + func['step'].replace('-', '_')
- handler_module: SanitizerHandler = importlib.import_module(module_name)
- self.handlers.append(handler_module.create(SanitizerConfig(func)))
+ if not isinstance(func['step'], str):
+ raise UsageError("'step' attribute must be a simple string.")
+
+ module: SanitizerHandler = \
+ config.load_plugin_module(func['step'], 'nominatim.tokenizer.sanitizers')
+
+ self.handlers.append(module.create(SanitizerConfig(func)))
def process_names(self, place: PlaceInfo) -> Tuple[List[PlaceName], List[PlaceName]]:
"""
Common data types and protocols for sanitizers.
"""
-from typing import Optional, Dict, List, Mapping, Callable
+from typing import Optional, List, Mapping, Callable
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
from nominatim.data.place_info import PlaceInfo
+from nominatim.data.place_name import PlaceName
from nominatim.typing import Protocol, Final
-class PlaceName:
- """ A searchable name for a place together with properties.
- Every name object saves the name proper and two basic properties:
- * 'kind' describes the name of the OSM key used without any suffixes
- (i.e. the part after the colon removed)
- * 'suffix' contains the suffix of the OSM tag, if any. The suffix
- is the part of the key after the first colon.
- In addition to that, the name may have arbitrary additional attributes.
- Which attributes are used, depends on the token analyser.
- """
-
- def __init__(self, name: str, kind: str, suffix: Optional[str]):
- self.name = name
- self.kind = kind
- self.suffix = suffix
- self.attr: Dict[str, str] = {}
-
-
- def __repr__(self) -> str:
- return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
-
-
- def clone(self, name: Optional[str] = None,
- kind: Optional[str] = None,
- suffix: Optional[str] = None,
- attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
- """ Create a deep copy of the place name, optionally with the
- given parameters replaced. In the attribute list only the given
- keys are updated. The list is not replaced completely.
- In particular, the function cannot to be used to remove an
- attribute from a place name.
- """
- newobj = PlaceName(name or self.name,
- kind or self.kind,
- suffix or self.suffix)
-
- newobj.attr.update(self.attr)
- if attr:
- newobj.attr.update(attr)
-
- return newobj
-
-
- def set_attr(self, key: str, value: str) -> None:
- """ Add the given property to the name. If the property was already
- set, then the value is overwritten.
- """
- self.attr[key] = value
-
-
- def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
- """ Return the given property or the value of 'default' if it
- is not set.
- """
- return self.attr.get(key, default)
-
-
- def has_attr(self, key: str) -> bool:
- """ Check if the given attribute is set.
- """
- return key in self.attr
-
class ProcessInfo:
""" Container class for information handed into to handler functions.
def create(self, config: SanitizerConfig) -> Callable[[ProcessInfo], None]:
"""
- A sanitizer must define a single function `create`. It takes the
- dictionary with the configuration information for the sanitizer and
- returns a function that transforms name and address.
+ Create a function for sanitizing a place.
+
+ Arguments:
+ config: A dictionary with the additional configuration options
+ specified in the tokenizer configuration
+
+ Return:
+ The result must be a callable that takes a place description
+ and transforms name and address as reuqired.
"""
from typing import Callable, Iterator, List
import re
-from nominatim.tokenizer.sanitizers.base import ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import ProcessInfo
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
class _HousenumberSanitizer:
def scan(self, postcode: str, country: Optional[str]) -> Optional[Tuple[str, str]]:
""" Check the postcode for correct formatting and return the
normalized version. Returns None if the postcode does not
- correspond to the oficial format of the given country.
+ correspond to the official format of the given country.
"""
match = self.matcher.match(country, postcode)
if match is None:
_BaseUserDict = UserDict
class SanitizerConfig(_BaseUserDict):
- """ Dictionary with configuration options for a sanitizer.
-
- In addition to the usual dictionary function, the class provides
- accessors to standard sanatizer options that are used by many of the
+ """ The `SanitizerConfig` class is a read-only dictionary
+ with configuration options for the sanitizer.
+ In addition to the usual dictionary functions, the class provides
+ accessors to standard sanitizer options that are used by many of the
sanitizers.
"""
def get_string_list(self, param: str, default: Sequence[str] = tuple()) -> Sequence[str]:
""" Extract a configuration parameter as a string list.
- If the parameter value is a simple string, it is returned as a
- one-item list. If the parameter value does not exist, the given
- default is returned. If the parameter value is a list, it is checked
- to contain only strings before being returned.
+
+ Arguments:
+ param: Name of the configuration parameter.
+ default: Value to return, when the parameter is missing.
+
+ Returns:
+ If the parameter value is a simple string, it is returned as a
+ one-item list. If the parameter value does not exist, the given
+ default is returned. If the parameter value is a list, it is
+ checked to contain only strings before being returned.
"""
values = self.data.get(param, None)
def get_bool(self, param: str, default: Optional[bool] = None) -> bool:
""" Extract a configuration parameter as a boolean.
- The parameter must be one of the yaml boolean values or an
- user error will be raised. If `default` is given, then the parameter
- may also be missing or empty.
+
+ Arguments:
+ param: Name of the configuration parameter. The parameter must
+ contain one of the yaml boolean values or an
+ UsageError will be raised.
+ default: Value to return, when the parameter is missing.
+ When set to `None`, the parameter must be defined.
+
+ Returns:
+ Boolean value of the given parameter.
"""
value = self.data.get(param, default)
def get_delimiter(self, default: str = ',;') -> Pattern[str]:
- """ Return the 'delimiter' parameter in the configuration as a
- compiled regular expression that can be used to split the names on the
- delimiters. The regular expression makes sure that the resulting names
- are stripped and that repeated delimiters
- are ignored but it will still create empty fields on occasion. The
- code needs to filter those.
-
- The 'default' parameter defines the delimiter set to be used when
- not explicitly configured.
+ """ Return the 'delimiters' parameter in the configuration as a
+ compiled regular expression that can be used to split strings on
+ these delimiters.
+
+ Arguments:
+ default: Delimiters to be used when 'delimiters' parameter
+ is not explicitly configured.
+
+ Returns:
+ A regular expression pattern which can be used to
+ split a string. The regular expression makes sure that the
+ resulting names are stripped and that repeated delimiters
+ are ignored. It may still create empty fields on occasion. The
+ code needs to filter those.
"""
delimiter_set = set(self.data.get('delimiters', default))
if not delimiter_set:
def get_filter_kind(self, *default: str) -> Callable[[str], bool]:
""" Return a filter function for the name kind from the 'filter-kind'
- config parameter. The filter functions takes a name item and returns
- True when the item passes the filter.
+ config parameter.
- If the parameter is empty, the filter lets all items pass. If the
- paramter is a string, it is interpreted as a single regular expression
- that must match the full kind string. If the parameter is a list then
+ If the 'filter-kind' parameter is empty, the filter lets all items
+ pass. If the parameter is a string, it is interpreted as a single
+ regular expression that must match the full kind string.
+ If the parameter is a list then
any of the regular expressions in the list must match to pass.
+
+ Arguments:
+ default: Filters to be used, when the 'filter-kind' parameter
+ is not specified. If omitted then the default is to
+ let all names pass.
+
+ Returns:
+ A filter function which takes a name string and returns
+ True when the item passes the filter.
"""
filters = self.get_string_list('filter-kind', default)
from typing import Mapping, List, Any
from nominatim.typing import Protocol
+from nominatim.data.place_name import PlaceName
-class Analyser(Protocol):
- """ Instance of the token analyser.
+class Analyzer(Protocol):
+ """ The `create()` function of an analysis module needs to return an
+ object that implements the following functions.
"""
- def normalize(self, name: str) -> str:
- """ Return the normalized form of the name. This is the standard form
- from which possible variants for the name can be derived.
+ def get_canonical_id(self, name: PlaceName) -> str:
+ """ Return the canonical form of the given name. The canonical ID must
+ be unique (the same ID must always yield the same variants) and
+ must be a form from which the variants can be derived.
+
+ Arguments:
+ name: Extended place name description as prepared by
+ the sanitizers.
+
+ Returns:
+ ID string with a canonical form of the name. The string may
+ be empty, when the analyzer cannot analyze the name at all,
+ for example because the character set in use does not match.
"""
- def get_variants_ascii(self, norm_name: str) -> List[str]:
- """ Compute the spelling variants for the given normalized name
- and transliterate the result.
+ def compute_variants(self, canonical_id: str) -> List[str]:
+ """ Compute the transliterated spelling variants for the given
+ canonical ID.
+
+ Arguments:
+ canonical_id: ID string previously computed with
+ `get_canonical_id()`.
+
+ Returns:
+ A list of possible spelling variants. All strings must have
+ been transformed with the global normalizer and
+ transliterator ICU rules. Otherwise they cannot be matched
+ against the input by the query frontend.
+ The list may be empty, when there are no useful
+ spelling variants. This may happen when an analyzer only
+ usually outputs additional variants to the canonical spelling
+ and there are no such variants.
"""
+
class AnalysisModule(Protocol):
- """ Protocol for analysis modules.
+ """ The setup of the token analysis is split into two parts:
+ configuration and analyser factory. A token analysis module must
+ therefore implement the two functions here described.
"""
- def configure(self, rules: Mapping[str, Any], normalization_rules: str) -> Any:
+ def configure(self, rules: Mapping[str, Any],
+ normalizer: Any, transliterator: Any) -> Any:
""" Prepare the configuration of the analysis module.
This function should prepare all data that can be shared
between instances of this analyser.
+
+ Arguments:
+ rules: A dictionary with the additional configuration options
+ as specified in the tokenizer configuration.
+ normalizer: an ICU Transliterator with the compiled
+ global normalization rules.
+ transliterator: an ICU Transliterator with the compiled
+ global transliteration rules.
+
+ Returns:
+ A data object with configuration data. This will be handed
+ as is into the `create()` function and may be
+ used freely by the analysis module as needed.
"""
- def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyser:
+ def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyzer:
""" Create a new instance of the analyser.
A separate instance of the analyser is created for each thread
when used in multi-threading context.
+
+ Arguments:
+ normalizer: an ICU Transliterator with the compiled normalization
+ rules.
+ transliterator: an ICU Transliterator with the compiled
+ transliteration rules.
+ config: The object that was returned by the call to configure().
+
+ Returns:
+ A new analyzer instance. This must be an object that implements
+ the Analyzer protocol.
"""
import itertools
import re
-from icu import Transliterator
-
from nominatim.config import flatten_config_list
from nominatim.errors import UsageError
def get_variant_config(in_rules: Any,
- normalization_rules: str) -> Tuple[List[Tuple[str, List[str]]], str]:
+ normalizer: Any) -> Tuple[List[Tuple[str, List[str]]], str]:
""" Convert the variant definition from the configuration into
replacement sets.
vset: Set[ICUVariant] = set()
rules = flatten_config_list(in_rules, 'variants')
- vmaker = _VariantMaker(normalization_rules)
+ vmaker = _VariantMaker(normalizer)
for section in rules:
for rule in (section.get('words') or []):
class _VariantMaker:
- """ Generater for all necessary ICUVariants from a single variant rule.
+ """ Generator for all necessary ICUVariants from a single variant rule.
All text in rules is normalized to make sure the variants match later.
"""
- def __init__(self, norm_rules: Any) -> None:
- self.norm = Transliterator.createFromRules("rule_loader_normalization",
- norm_rules)
+ def __init__(self, normalizer: Any) -> None:
+ self.norm = normalizer
def compute(self, rule: Any) -> Iterator[ICUVariant]:
import datrie
from nominatim.errors import UsageError
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.token_analysis.config_variants import get_variant_config
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> Dict[str, Any]:
+def configure(rules: Mapping[str, Any], normalizer: Any, _: Any) -> Dict[str, Any]:
""" Extract and preprocess the configuration for this module.
"""
config: Dict[str, Any] = {}
config['replacements'], config['chars'] = get_variant_config(rules.get('variants'),
- normalization_rules)
+ normalizer)
config['variant_only'] = rules.get('mode', '') == 'variant-only'
# parse mutation rules
self.mutations = [MutationVariantGenerator(*cfg) for cfg in config['mutations']]
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the normalized form of the name. This is the standard form
from which possible variants for the name can be derived.
"""
- return cast(str, self.norm.transliterate(name)).strip()
+ return cast(str, self.norm.transliterate(name.name)).strip()
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized name
and transliterate the result.
"""
class MutationVariantGenerator:
""" Generates name variants by applying a regular expression to the name
and replacing it with one or more variants. When the regular expression
- matches more than once, each occurence is replaced with all replacement
+ matches more than once, each occurrence is replaced with all replacement
patterns.
"""
Specialized processor for housenumbers. Analyses common housenumber patterns
and creates variants for them.
"""
-from typing import Mapping, Any, List, cast
+from typing import Any, List, cast
import re
+from nominatim.data.place_name import PlaceName
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
RE_NON_DIGIT = re.compile('[^0-9]')
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
""" All behaviour is currently hard-coded.
"""
return None
self.mutator = MutationVariantGenerator('␣', (' ', ''))
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the normalized form of the housenumber.
"""
# shortcut for number-only numbers, which make up 90% of the data.
- if RE_NON_DIGIT.search(name) is None:
- return name
+ if RE_NON_DIGIT.search(name.name) is None:
+ return name.name
- norm = cast(str, self.trans.transliterate(self.norm.transliterate(name)))
+ norm = cast(str, self.trans.transliterate(self.norm.transliterate(name.name)))
# If there is a significant non-numeric part, use as is.
if RE_NAMED_PART.search(norm) is None:
# Otherwise add optional spaces between digits and letters.
return norm
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized housenumber.
Generates variants for optional spaces (marked with '␣').
Specialized processor for postcodes. Supports a 'lookup' variant of the
token, which produces variants with optional spaces.
"""
-from typing import Mapping, Any, List
+from typing import Any, List
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
+from nominatim.data.place_name import PlaceName
### Configuration section
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
""" All behaviour is currently hard-coded.
"""
return None
""" Special normalization and variant generation for postcodes.
This analyser must not be used with anything but postcodes as
- it follows some special rules: `normalize` doesn't necessarily
- need to return a standard form as per normalization rules. It
- needs to return the canonical form of the postcode that is also
- used for output. `get_variants_ascii` then needs to ensure that
+ it follows some special rules: the canonial ID is the form that
+ is used for the output. `compute_variants` then needs to ensure that
the generated variants once more follow the standard normalization
and transliteration, so that postcodes are correctly recognised by
the search algorithm.
self.mutator = MutationVariantGenerator(' ', (' ', ''))
- def normalize(self, name: str) -> str:
+ def get_canonical_id(self, name: PlaceName) -> str:
""" Return the standard form of the postcode.
"""
- return name.strip().upper()
+ return name.name.strip().upper()
- def get_variants_ascii(self, norm_name: str) -> List[str]:
+ def compute_variants(self, norm_name: str) -> List[str]:
""" Compute the spelling variants for the given normalized postcode.
Takes the canonical form of the postcode, normalizes it using the
return CheckState.FATAL, dict(config=config)
-@_check(hint="""placex table has no data. Did the import finish sucessfully?""")
+@_check(hint="""placex table has no data. Did the import finish successfully?""")
def check_placex_size(conn: Connection, _: Configuration) -> CheckResult:
""" Checking for placex content
"""
tokenizer = tokenizer_factory.get_tokenizer_for_db(config)
except UsageError:
return CheckState.FAIL, dict(msg="""\
- Cannot load tokenizer. Did the import finish sucessfully?""")
+ Cannot load tokenizer. Did the import finish successfully?""")
result = tokenizer.check_database(config)
for version, func in _MIGRATION_FUNCTIONS:
if db_version <= version:
title = func.__doc__ or ''
- LOG.warning("Runnning: %s (%s)", title.split('\n', 1)[0],
+ LOG.warning("Running: %s (%s)", title.split('\n', 1)[0],
version_str(version))
kwargs = dict(conn=conn, config=config, paths=paths)
func(**kwargs)
def add_step_column_for_interpolation(conn: Connection, **_: Any) -> None:
""" Add a new column 'step' to the interpolations table.
- Also convers the data into the stricter format which requires that
+ Also converts the data into the stricter format which requires that
startnumbers comply with the odd/even requirements.
"""
if conn.table_has_column('location_property_osmline', 'step'):
def import_wikipedia_articles(dsn: str, data_path: Path, ignore_errors: bool = False) -> int:
""" Replaces the wikipedia importance tables with new data.
The import is run in a single transaction so that the new data
- is replace seemlessly.
+ is replace seamlessly.
Returns 0 if all was well and 1 if the importance file could not
be found. Throws an exception if there was an error reading the file.
self.black_list, self.white_list = self._load_white_and_black_lists()
self.sanity_check_pattern = re.compile(r'^\w+$')
# This set will contain all existing phrases to be added.
- # It contains tuples with the following format: (lable, class, type, operator)
+ # It contains tuples with the following format: (label, class, type, operator)
self.word_phrases: Set[Tuple[str, str, str, str]] = set()
# This set will contain all existing place_classtype tables which doesn't match any
# special phrases class/type on the wiki.
"""
from typing import Any, Union, Mapping, TypeVar, Sequence, TYPE_CHECKING
-# Generics varaible names do not confirm to naming styles, ignore globally here.
+# Generics variable names do not confirm to naming styles, ignore globally here.
# pylint: disable=invalid-name,abstract-method,multiple-statements
# pylint: disable=missing-class-docstring,useless-import-alias
POSTGRESQL_REQUIRED_VERSION = (9, 5)
POSTGIS_REQUIRED_VERSION = (2, 2)
-# Cmake sets a variabe @GIT_HASH@ by executing 'git --log'. It is not run
+# Cmake sets a variable @GIT_HASH@ by executing 'git --log'. It is not run
# on every execution of 'make'.
# cmake/tool-installed.tmpl is used to build the binary 'nominatim'. Inside
# there is a call to set the variable value below.
| Triesenberg |
+ Scenario: Array parameters are ignored
+ When sending json search query "Vaduz" with address
+ | countrycodes[] | polygon_svg[] | limit[] | polygon_threshold[] |
+ | IT | 1 | 3 | 3.4 |
+ Then result addresses contain
+ | ID | country_code |
+ | 0 | li |
public function testGetSet()
{
- $this->expectException(\Exception::class);
- $this->expectExceptionMessage("Parameter 'val3' must be one of: foo, bar");
-
$oParams = new ParameterParser(array(
'val1' => 'foo',
'val2' => '',
$this->assertSame('foo', $oParams->getSet('val1', array('foo', 'bar')));
$this->assertSame(false, $oParams->getSet('val2', array('foo', 'bar')));
- $oParams->getSet('val3', array('foo', 'bar'));
+ $this->assertSame(false, $oParams->getSet('val3', array('foo', 'bar')));
}
--- /dev/null
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This file is part of Nominatim. (https://nominatim.org)
+#
+# Copyright (C) 2022 by the Nominatim developer community.
+# For a full list of authors see the git log.
+"""
+Test for loading extra Python modules.
+"""
+from pathlib import Path
+import sys
+
+import pytest
+
+from nominatim.config import Configuration
+
+@pytest.fixture
+def test_config(src_dir, tmp_path):
+ """ Create a configuration object with project and config directories
+ in a temporary directory.
+ """
+ (tmp_path / 'project').mkdir()
+ (tmp_path / 'config').mkdir()
+ conf = Configuration(tmp_path / 'project', src_dir / 'settings')
+ conf.config_dir = tmp_path / 'config'
+ return conf
+
+
+def test_load_default_module(test_config):
+ module = test_config.load_plugin_module('version', 'nominatim')
+
+ assert isinstance(module.NOMINATIM_VERSION, tuple)
+
+def test_load_default_module_with_hyphen(test_config):
+ module = test_config.load_plugin_module('place-info', 'nominatim.data')
+
+ assert isinstance(module.PlaceInfo, object)
+
+
+def test_load_plugin_module(test_config, tmp_path):
+ (tmp_path / 'project' / 'testpath').mkdir()
+ (tmp_path / 'project' / 'testpath' / 'mymod.py')\
+ .write_text("def my_test_function():\n return 'gjwitlsSG42TG%'")
+
+ module = test_config.load_plugin_module('testpath/mymod.py', 'private.something')
+
+ assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+ # also test reloading module
+ (tmp_path / 'project' / 'testpath' / 'mymod.py')\
+ .write_text("def my_test_function():\n return 'hjothjorhj'")
+
+ module = test_config.load_plugin_module('testpath/mymod.py', 'private.something')
+
+ assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+
+def test_load_external_library_module(test_config, tmp_path, monkeypatch):
+ MODULE_NAME = 'foogurenqodr4'
+ pythonpath = tmp_path / 'priv-python'
+ pythonpath.mkdir()
+ (pythonpath / MODULE_NAME).mkdir()
+ (pythonpath / MODULE_NAME / '__init__.py').write_text('')
+ (pythonpath / MODULE_NAME / 'tester.py')\
+ .write_text("def my_test_function():\n return 'gjwitlsSG42TG%'")
+
+ monkeypatch.syspath_prepend(pythonpath)
+
+ module = test_config.load_plugin_module(f'{MODULE_NAME}.tester', 'private.something')
+
+ assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+ # also test reloading module
+ (pythonpath / MODULE_NAME / 'tester.py')\
+ .write_text("def my_test_function():\n return 'dfigjreigj'")
+
+ module = test_config.load_plugin_module(f'{MODULE_NAME}.tester', 'private.something')
+
+ assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+ del sys.modules[f'{MODULE_NAME}.tester']
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
-Tests for specialised conenction and cursor classes.
+Tests for specialised connection and cursor classes.
"""
import pytest
import psycopg2
from nominatim.data.place_info import PlaceInfo
@pytest.fixture
-def sanitize(request):
+def sanitize(request, def_config):
sanitizer_args = {'step': 'clean-housenumbers'}
for mark in request.node.iter_markers(name="sanitizer_params"):
sanitizer_args.update({k.replace('_', '-') : v for k,v in mark.kwargs.items()})
def _run(**kwargs):
place = PlaceInfo({'address': kwargs})
- _, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+ _, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
return sorted([(p.kind, p.name) for p in address])
@pytest.mark.parametrize('number', ('6523', 'n/a', '4'))
-def test_convert_to_name_converted(number):
+def test_convert_to_name_converted(def_config, number):
sanitizer_args = {'step': 'clean-housenumbers',
'convert-to-name': (r'\d+', 'n/a')}
place = PlaceInfo({'address': {'housenumber': number}})
- names, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+ names, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
assert ('housenumber', number) in set((p.kind, p.name) for p in names)
assert 'housenumber' not in set(p.kind for p in address)
@pytest.mark.parametrize('number', ('a54', 'n.a', 'bow'))
-def test_convert_to_name_unconverted(number):
+def test_convert_to_name_unconverted(def_config, number):
sanitizer_args = {'step': 'clean-housenumbers',
'convert-to-name': (r'\d+', 'n/a')}
place = PlaceInfo({'address': {'housenumber': number}})
- names, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+ names, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
assert 'housenumber' not in set(p.kind for p in names)
assert ('housenumber', number) in set((p.kind, p.name) for p in address)
if country is not None:
pi['country_code'] = country
- _, address = PlaceSanitizer([sanitizer_args]).process_names(PlaceInfo(pi))
+ _, address = PlaceSanitizer([sanitizer_args], def_config).process_names(PlaceInfo(pi))
return sorted([(p.kind, p.name) for p in address])
from nominatim.errors import UsageError
-def run_sanitizer_on(**kwargs):
- place = PlaceInfo({'name': kwargs})
- name, _ = PlaceSanitizer([{'step': 'split-name-list'}]).process_names(place)
+class TestSplitName:
- return sorted([(p.name, p.kind, p.suffix) for p in name])
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
-def sanitize_with_delimiter(delimiter, name):
- place = PlaceInfo({'name': {'name': name}})
- san = PlaceSanitizer([{'step': 'split-name-list', 'delimiters': delimiter}])
- name, _ = san.process_names(place)
+ def run_sanitizer_on(self, **kwargs):
+ place = PlaceInfo({'name': kwargs})
+ name, _ = PlaceSanitizer([{'step': 'split-name-list'}], self.config).process_names(place)
- return sorted([p.name for p in name])
+ return sorted([(p.name, p.kind, p.suffix) for p in name])
-def test_simple():
- assert run_sanitizer_on(name='ABC') == [('ABC', 'name', None)]
- assert run_sanitizer_on(name='') == [('', 'name', None)]
+ def sanitize_with_delimiter(self, delimiter, name):
+ place = PlaceInfo({'name': {'name': name}})
+ san = PlaceSanitizer([{'step': 'split-name-list', 'delimiters': delimiter}],
+ self.config)
+ name, _ = san.process_names(place)
+ return sorted([p.name for p in name])
-def test_splits():
- assert run_sanitizer_on(name='A;B;C') == [('A', 'name', None),
- ('B', 'name', None),
- ('C', 'name', None)]
- assert run_sanitizer_on(short_name=' House, boat ') == [('House', 'short_name', None),
- ('boat', 'short_name', None)]
+ def test_simple(self):
+ assert self.run_sanitizer_on(name='ABC') == [('ABC', 'name', None)]
+ assert self.run_sanitizer_on(name='') == [('', 'name', None)]
-def test_empty_fields():
- assert run_sanitizer_on(name='A;;B') == [('A', 'name', None),
- ('B', 'name', None)]
- assert run_sanitizer_on(name='A; ,B') == [('A', 'name', None),
- ('B', 'name', None)]
- assert run_sanitizer_on(name=' ;B') == [('B', 'name', None)]
- assert run_sanitizer_on(name='B,') == [('B', 'name', None)]
+ def test_splits(self):
+ assert self.run_sanitizer_on(name='A;B;C') == [('A', 'name', None),
+ ('B', 'name', None),
+ ('C', 'name', None)]
+ assert self.run_sanitizer_on(short_name=' House, boat ') == [('House', 'short_name', None),
+ ('boat', 'short_name', None)]
-def test_custom_delimiters():
- assert sanitize_with_delimiter(':', '12:45,3') == ['12', '45,3']
- assert sanitize_with_delimiter('\\', 'a;\\b!#@ \\') == ['a;', 'b!#@']
- assert sanitize_with_delimiter('[]', 'foo[to]be') == ['be', 'foo', 'to']
- assert sanitize_with_delimiter(' ', 'morning sun') == ['morning', 'sun']
+ def test_empty_fields(self):
+ assert self.run_sanitizer_on(name='A;;B') == [('A', 'name', None),
+ ('B', 'name', None)]
+ assert self.run_sanitizer_on(name='A; ,B') == [('A', 'name', None),
+ ('B', 'name', None)]
+ assert self.run_sanitizer_on(name=' ;B') == [('B', 'name', None)]
+ assert self.run_sanitizer_on(name='B,') == [('B', 'name', None)]
-def test_empty_delimiter_set():
- with pytest.raises(UsageError):
- sanitize_with_delimiter('', 'abc')
+ def test_custom_delimiters(self):
+ assert self.sanitize_with_delimiter(':', '12:45,3') == ['12', '45,3']
+ assert self.sanitize_with_delimiter('\\', 'a;\\b!#@ \\') == ['a;', 'b!#@']
+ assert self.sanitize_with_delimiter('[]', 'foo[to]be') == ['be', 'foo', 'to']
+ assert self.sanitize_with_delimiter(' ', 'morning sun') == ['morning', 'sun']
-def test_no_name_list():
+
+ def test_empty_delimiter_set(self):
+ with pytest.raises(UsageError):
+ self.sanitize_with_delimiter('', 'abc')
+
+
+def test_no_name_list(def_config):
place = PlaceInfo({'address': {'housenumber': '3'}})
- name, address = PlaceSanitizer([{'step': 'split-name-list'}]).process_names(place)
+ name, address = PlaceSanitizer([{'step': 'split-name-list'}], def_config).process_names(place)
assert not name
assert len(address) == 1
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
from nominatim.data.place_info import PlaceInfo
-def run_sanitizer_on(**kwargs):
- place = PlaceInfo({'name': kwargs})
- name, _ = PlaceSanitizer([{'step': 'strip-brace-terms'}]).process_names(place)
+class TestStripBrace:
- return sorted([(p.name, p.kind, p.suffix) for p in name])
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
+ def run_sanitizer_on(self, **kwargs):
+ place = PlaceInfo({'name': kwargs})
+ name, _ = PlaceSanitizer([{'step': 'strip-brace-terms'}], self.config).process_names(place)
-def test_no_braces():
- assert run_sanitizer_on(name='foo', ref='23') == [('23', 'ref', None),
- ('foo', 'name', None)]
+ return sorted([(p.name, p.kind, p.suffix) for p in name])
-def test_simple_braces():
- assert run_sanitizer_on(name='Halle (Saale)', ref='3')\
- == [('3', 'ref', None), ('Halle', 'name', None), ('Halle (Saale)', 'name', None)]
- assert run_sanitizer_on(name='ack ( bar')\
- == [('ack', 'name', None), ('ack ( bar', 'name', None)]
+ def test_no_braces(self):
+ assert self.run_sanitizer_on(name='foo', ref='23') == [('23', 'ref', None),
+ ('foo', 'name', None)]
-def test_only_braces():
- assert run_sanitizer_on(name='(maybe)') == [('(maybe)', 'name', None)]
+ def test_simple_braces(self):
+ assert self.run_sanitizer_on(name='Halle (Saale)', ref='3')\
+ == [('3', 'ref', None), ('Halle', 'name', None), ('Halle (Saale)', 'name', None)]
+ assert self.run_sanitizer_on(name='ack ( bar')\
+ == [('ack', 'name', None), ('ack ( bar', 'name', None)]
-def test_double_braces():
- assert run_sanitizer_on(name='a((b))') == [('a', 'name', None),
- ('a((b))', 'name', None)]
- assert run_sanitizer_on(name='a (b) (c)') == [('a', 'name', None),
- ('a (b) (c)', 'name', None)]
+ def test_only_braces(self):
+ assert self.run_sanitizer_on(name='(maybe)') == [('(maybe)', 'name', None)]
-def test_no_names():
+ def test_double_braces(self):
+ assert self.run_sanitizer_on(name='a((b))') == [('a', 'name', None),
+ ('a((b))', 'name', None)]
+ assert self.run_sanitizer_on(name='a (b) (c)') == [('a', 'name', None),
+ ('a (b) (c)', 'name', None)]
+
+
+def test_no_names(def_config):
place = PlaceInfo({'address': {'housenumber': '3'}})
- name, address = PlaceSanitizer([{'step': 'strip-brace-terms'}]).process_names(place)
+ name, address = PlaceSanitizer([{'step': 'strip-brace-terms'}], def_config).process_names(place)
assert not name
assert len(address) == 1
class TestWithDefaults:
- @staticmethod
- def run_sanitizer_on(country, **kwargs):
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
+
+
+ def run_sanitizer_on(self, country, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
'country_code': country})
- name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}]).process_names(place)
+ name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}],
+ self.config).process_names(place)
return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
class TestFilterKind:
- @staticmethod
- def run_sanitizer_on(filt, **kwargs):
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
+
+
+ def run_sanitizer_on(self, filt, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
'country_code': 'de'})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
- 'filter-kind': filt}]).process_names(place)
+ 'filter-kind': filt}],
+ self.config).process_names(place)
return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
@pytest.fixture(autouse=True)
def setup_country(self, def_config):
setup_country_config(def_config)
+ self.config = def_config
+
- @staticmethod
- def run_sanitizer_append(mode, country, **kwargs):
+ def run_sanitizer_append(self, mode, country, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
'country_code': country})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
'use-defaults': mode,
- 'mode': 'append'}]).process_names(place)
+ 'mode': 'append'}],
+ self.config).process_names(place)
assert all(isinstance(p.attr, dict) for p in name)
assert all(len(p.attr) <= 1 for p in name)
return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
- @staticmethod
- def run_sanitizer_replace(mode, country, **kwargs):
+ def run_sanitizer_replace(self, mode, country, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
'country_code': country})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
'use-defaults': mode,
- 'mode': 'replace'}]).process_names(place)
+ 'mode': 'replace'}],
+ self.config).process_names(place)
assert all(isinstance(p.attr, dict) for p in name)
assert all(len(p.attr) <= 1 for p in name)
place = PlaceInfo({'name': {'name': 'something'}})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
'use-defaults': 'all',
- 'mode': 'replace'}]).process_names(place)
+ 'mode': 'replace'}],
+ self.config).process_names(place)
assert len(name) == 1
assert name[0].name == 'something'
class TestCountryWithWhitelist:
- @staticmethod
- def run_sanitizer_on(mode, country, **kwargs):
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
+
+
+ def run_sanitizer_on(self, mode, country, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
'country_code': country})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
'use-defaults': mode,
'mode': 'replace',
- 'whitelist': ['de', 'fr', 'ru']}]).process_names(place)
+ 'whitelist': ['de', 'fr', 'ru']}],
+ self.config).process_names(place)
assert all(isinstance(p.attr, dict) for p in name)
assert all(len(p.attr) <= 1 for p in name)
class TestWhiteList:
- @staticmethod
- def run_sanitizer_on(whitelist, **kwargs):
+ @pytest.fixture(autouse=True)
+ def setup_country(self, def_config):
+ self.config = def_config
+
+
+ def run_sanitizer_on(self, whitelist, **kwargs):
place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}})
name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
'mode': 'replace',
- 'whitelist': whitelist}]).process_names(place)
+ 'whitelist': whitelist}],
+ self.config).process_names(place)
assert all(isinstance(p.attr, dict) for p in name)
assert all(len(p.attr) <= 1 for p in name)
assert not place.has_attr('whatever')
-def test_sanitizer_default():
- san = sanitizer.PlaceSanitizer([{'step': 'split-name-list'}])
+def test_sanitizer_default(def_config):
+ san = sanitizer.PlaceSanitizer([{'step': 'split-name-list'}], def_config)
name, address = san.process_names(PlaceInfo({'name': {'name:de:de': '1;2;3'},
'address': {'street': 'Bald'}}))
@pytest.mark.parametrize('rules', [None, []])
-def test_sanitizer_empty_list(rules):
- san = sanitizer.PlaceSanitizer(rules)
+def test_sanitizer_empty_list(def_config, rules):
+ san = sanitizer.PlaceSanitizer(rules, def_config)
name, address = san.process_names(PlaceInfo({'name': {'name:de:de': '1;2;3'}}))
assert all(isinstance(n, sanitizer.PlaceName) for n in name)
-def test_sanitizer_missing_step_definition():
+def test_sanitizer_missing_step_definition(def_config):
with pytest.raises(UsageError):
- san = sanitizer.PlaceSanitizer([{'id': 'split-name-list'}])
+ san = sanitizer.PlaceSanitizer([{'id': 'split-name-list'}], def_config)
from icu import Transliterator
import nominatim.tokenizer.token_analysis.postcodes as module
+from nominatim.data.place_name import PlaceName
from nominatim.errors import UsageError
DEFAULT_NORMALIZATION = """ :: NFD ();
def get_normalized_variants(proc, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return proc.get_variants_ascii(norm.transliterate(name).strip())
+ return proc.compute_variants(norm.transliterate(name).strip())
@pytest.mark.parametrize('name,norm', [('12', '12'),
('A 34 ', 'A 34'),
('34-av', '34-AV')])
-def test_normalize(analyser, name, norm):
- assert analyser.normalize(name) == norm
+def test_get_canonical_id(analyser, name, norm):
+ assert analyser.get_canonical_id(PlaceName(name=name, kind='', suffix='')) == norm
@pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
('AB-998', {'ab 998', 'ab998'}),
('23 FGH D3', {'23 fgh d3', '23fgh d3',
'23 fghd3', '23fghd3'})])
-def test_get_variants_ascii(analyser, postcode, variants):
- out = analyser.get_variants_ascii(postcode)
+def test_compute_variants(analyser, postcode, variants):
+ out = analyser.compute_variants(postcode)
assert len(out) == len(set(out))
assert set(out) == variants
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
if variant_only:
rules['mode'] = 'variant-only'
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
return module.create(norm, trans, config)
def get_normalized_variants(proc, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return proc.get_variants_ascii(norm.transliterate(name).strip())
+ return proc.compute_variants(norm.transliterate(name).strip())
def test_no_variants():
rules = { 'analyzer': 'generic' }
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
proc = module.create(norm, trans, config)
@staticmethod
def configure_rules(*variants):
rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
- return module.configure(rules, DEFAULT_NORMALIZATION)
+ trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
+ norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ return module.configure(rules, norm, trans)
def get_replacements(self, *variants):
'mutations': [ {'pattern': m[0], 'replacements': m[1]}
for m in mutations]
}
- config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+ config = module.configure(rules, norm, trans)
self.analysis = module.create(norm, trans, config)
def variants(self, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
- return set(self.analysis.get_variants_ascii(norm.transliterate(name).strip()))
+ return set(self.analysis.compute_variants(norm.transliterate(name).strip()))
@pytest.mark.parametrize('pattern', ('(capture)', ['a list']))
+++ /dev/null
-#!/bin/bash -ex
-#
-# *Note:* these installation instructions are also available in executable
-# form for use with vagrant under `vagrant/Install-on-Centos-8.sh`.
-#
-# Installing the Required Software
-# ================================
-#
-# These instructions expect that you have a freshly installed CentOS version 8.
-# Make sure all packages are up-to-date by running:
-#
- sudo dnf update -y
-
-# The standard CentOS repositories don't contain all the required packages,
-# you need to enable the EPEL repository as well. For example for SELinux
-# related redhat-hardened-cc1 package. To enable it on CentOS run:
-
- sudo dnf install -y epel-release redhat-rpm-config
-
-# EPEL contains Postgres 9.6 and 10, but not PostGIS. Postgres 9.4+/10/11/12
-# and PostGIS 2.4/2.5/3.0 are availble from postgresql.org. Enable these
-# repositories and make sure, the binaries can be found:
-
- sudo dnf -qy module disable postgresql
- sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
- export PATH=/usr/pgsql-12/bin:$PATH
-
-# Now you can install all packages needed for Nominatim:
-
-#DOCS: :::sh
- sudo dnf --enablerepo=powertools install -y postgresql12-server \
- postgresql12-contrib postgresql12-devel postgis30_12 \
- wget git cmake make gcc gcc-c++ libtool policycoreutils-python-utils \
- llvm-toolset ccache clang-tools-extra \
- php-pgsql php php-intl php-json libpq-devel \
- bzip2-devel proj-devel boost-devel \
- python3-pip python3-setuptools python3-devel \
- python3-psycopg2 \
- expat-devel zlib-devel libicu-devel
-
- pip3 install --user python-dotenv psutil Jinja2 PyICU datrie pyyaml
-
-
-#
-# System Configuration
-# ====================
-#
-# The following steps are meant to configure a fresh CentOS installation
-# for use with Nominatim. You may skip some of the steps if you have your
-# OS already configured.
-#
-# Creating Dedicated User Accounts
-# --------------------------------
-#
-# Nominatim will run as a global service on your machine. It is therefore
-# best to install it under its own separate user account. In the following
-# we assume this user is called nominatim and the installation will be in
-# /srv/nominatim. To create the user and directory run:
-#
-# sudo useradd -d /srv/nominatim -s /bin/bash -m nominatim
-#
-# You may find a more suitable location if you wish.
-#
-# To be able to copy and paste instructions from this manual, export
-# user name and home directory now like this:
-#
-if [ "x$USERNAME" == "x" ]; then #DOCS:
- export USERNAME=vagrant #DOCS: export USERNAME=nominatim
- export USERHOME=/srv/nominatim
- sudo mkdir -p /srv/nominatim #DOCS:
- sudo chown vagrant /srv/nominatim #DOCS:
-fi #DOCS:
-#
-# **Never, ever run the installation as a root user.** You have been warned.
-#
-# Make sure that system servers can read from the home directory:
-
- chmod a+x $USERHOME
-
-# Setting up PostgreSQL
-# ---------------------
-#
-# CentOS does not automatically create a database cluster. Therefore, start
-# with initializing the database:
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then #DOCS:
- sudo -u postgres /usr/pgsql-12/bin/pg_ctl initdb -D /var/lib/pgsql/12/data #DOCS:
- sudo mkdir /var/log/postgresql #DOCS:
- sudo chown postgres. /var/log/postgresql #DOCS:
-else #DOCS:
- sudo /usr/pgsql-12/bin/postgresql-12-setup initdb
-fi #DOCS:
-#
-# Next tune the postgresql configuration, which is located in
-# `/var/lib/pgsql/12/data/postgresql.conf`. See section *Postgres Tuning* in
-# [the installation page](../admin/Installation.md#postgresql-tuning)
-# for the parameters to change.
-#
-# Now start the postgresql service after updating this config file:
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then #DOCS:
- sudo -u postgres /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/12/data -l /var/log/postgresql/postgresql-12.log start #DOCS:
-else #DOCS:
- sudo systemctl enable postgresql-12
- sudo systemctl restart postgresql-12
-fi #DOCS:
-
-#
-# Finally, we need to add two postgres users: one for the user that does
-# the import and another for the webserver which should access the database
-# only for reading:
-#
-
- sudo -u postgres createuser -s $USERNAME
- sudo -u postgres createuser apache
-
-#
-# Installing Nominatim
-# ====================
-#
-# Building and Configuration
-# --------------------------
-#
-# Get the source code from Github and change into the source directory
-#
-if [ "x$1" == "xyes" ]; then #DOCS: :::sh
- cd $USERHOME
- git clone --recursive https://github.com/openstreetmap/Nominatim.git
- cd Nominatim
-else #DOCS:
- cd $USERHOME/Nominatim #DOCS:
-fi #DOCS:
-
-# When installing the latest source from github, you also need to
-# download the country grid:
-
-if [ ! -f data/country_osm_grid.sql.gz ]; then #DOCS: :::sh
- wget --no-verbose -O data/country_osm_grid.sql.gz https://www.nominatim.org/data/country_grid.sql.gz
-fi #DOCS:
-
-# The code must be built in a separate directory. Create this directory,
-# then configure and build Nominatim in there:
-
-#DOCS: :::sh
- mkdir $USERHOME/build
- cd $USERHOME/build
- cmake $USERHOME/Nominatim
- make
- sudo make install
-
-#
-# Setting up the Apache Webserver
-# -------------------------------
-#
-# The webserver should serve the php scripts from the website directory of your
-# [project directory](../admin/Import.md#creating-the-project-directory).
-# This directory needs to exist when the webserver is configured.
-# Therefore set up a project directory and create the website directory:
-#
- mkdir $USERHOME/nominatim-project
- mkdir $USERHOME/nominatim-project/website
-#
-# You need to create an alias to the website directory in your apache
-# configuration. Add a separate nominatim configuration to your webserver:
-
-#DOCS:```sh
-sudo tee /etc/httpd/conf.d/nominatim.conf << EOFAPACHECONF
-<Directory "$USERHOME/nominatim-project/website">
- Options FollowSymLinks MultiViews
- AddType text/html .php
- DirectoryIndex search.php
- Require all granted
-</Directory>
-
-Alias /nominatim $USERHOME/nominatim-project/website
-EOFAPACHECONF
-#DOCS:```
-
-sudo sed -i 's:#.*::' /etc/httpd/conf.d/nominatim.conf #DOCS:
-
-#
-# Then reload apache:
-#
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then #DOCS:
- sudo httpd #DOCS:
-else #DOCS:
- sudo systemctl enable httpd
- sudo systemctl restart httpd
-fi #DOCS:
-
-#
-# Adding SELinux Security Settings
-# --------------------------------
-#
-# It is a good idea to leave SELinux enabled and enforcing, particularly
-# with a web server accessible from the Internet. At a minimum the
-# following SELinux labeling should be done for Nominatim:
-
-if [ "x$HAVE_SELINUX" != "xno" ]; then #DOCS:
- sudo semanage fcontext -a -t httpd_sys_content_t "/usr/local/nominatim/lib/lib-php(/.*)?"
- sudo semanage fcontext -a -t httpd_sys_content_t "$USERHOME/nominatim-project/website(/.*)?"
- sudo semanage fcontext -a -t lib_t "$USERHOME/nominatim-project/module/nominatim.so"
- sudo restorecon -R -v /usr/local/lib/nominatim
- sudo restorecon -R -v $USERHOME/nominatim-project
-fi #DOCS:
-
-# You need to create a minimal configuration file that tells nominatim
-# the name of your webserver user:
-
-#DOCS:```sh
-echo NOMINATIM_DATABASE_WEBUSER="apache" | tee $USERHOME/nominatim-project/.env
-#DOCS:```
-
-
-# Nominatim is now ready to use. Continue with
-# [importing a database from OSM data](../admin/Import.md).
sudo apt install -y php-cgi
sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
libboost-filesystem-dev libexpat1-dev zlib1g-dev\
- libbz2-dev libpq-dev libproj-dev \
+ libbz2-dev libpq-dev \
postgresql-10-postgis-2.4 \
postgresql-contrib-10 postgresql-10-postgis-scripts \
- php php-pgsql php-intl libicu-dev python3-pip \
+ php-cli php-pgsql php-intl libicu-dev python3-pip \
python3-psutil python3-jinja2 python3-yaml python3-icu git
# Some of the Python packages that come with Ubuntu 18.04 are too old, so
sudo apt install -y php-cgi
sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
libboost-filesystem-dev libexpat1-dev zlib1g-dev \
- libbz2-dev libpq-dev libproj-dev \
+ libbz2-dev libpq-dev \
postgresql-12-postgis-3 \
postgresql-contrib-12 postgresql-12-postgis-3-scripts \
- php php-pgsql php-intl libicu-dev python3-dotenv \
+ php-cli php-pgsql php-intl libicu-dev python3-dotenv \
python3-psycopg2 python3-psutil python3-jinja2 \
python3-icu python3-datrie python3-yaml git
sudo apt install -y php-cgi
sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
libboost-filesystem-dev libexpat1-dev zlib1g-dev \
- libbz2-dev libpq-dev libproj-dev \
+ libbz2-dev libpq-dev \
postgresql-server-dev-14 postgresql-14-postgis-3 \
postgresql-contrib-14 postgresql-14-postgis-3-scripts \
- php php-pgsql php-intl libicu-dev python3-dotenv \
+ php-cli php-pgsql php-intl libicu-dev python3-dotenv \
python3-psycopg2 python3-psutil python3-jinja2 \
python3-icu python3-datrie git