Merge remote-tracking branch 'upstream/master'

author Sarah Hoffmann <lonvia@denofr.de>

Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)

committer Sarah Hoffmann <lonvia@denofr.de>

Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)
author Sarah Hoffmann <lonvia@denofr.de>
Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)
committer Sarah Hoffmann <lonvia@denofr.de>
Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md

index 6798c39dc1fcde20525204cd2af4abbc2e392e2a..e031cd91116df9030c0e82f4f1dd1eebcbcfe37b 100644 (file)
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -36,7 +36,7 @@ Nominatim historically hasn't followed a particular coding style but we
  are in process of consolidating the style. The following rules apply:
  
   * Python code uses the official Python style
- * indention
+ * indentation
     * SQL use 2 spaces
     * all other file types use 4 spaces
     * [BSD style](https://en.wikipedia.org/wiki/Indent_style#Allman_style) for braces
diff --git a/VAGRANT.md b/VAGRANT.md

index b0df9a882f881a4117a0c12c1b1c95380108aab8..e00e09542c71f3d7c266b602725eb18c3efcfd25 100644 (file)
--- a/VAGRANT.md
+++ b/VAGRANT.md
@@ -56,7 +56,7 @@ is.
  ## Development
  
  Vagrant maps the virtual machine's port 8089 to your host machine. Thus you can
-see Nominatim in action on [locahost:8089](http://localhost:8089/nominatim/).
+see Nominatim in action on [localhost:8089](http://localhost:8089/nominatim/).
  
  You edit code on your host machine in any editor you like. There is no need to
  restart any software: just refresh your browser window.
diff --git a/docs/CMakeLists.txt b/docs/CMakeLists.txt

index 0ccc5974d23adcf484201f6e0e036dbc9e91a338..4fa860ad64fe9b61dc67e305a1830bbc3242114a 100644 (file)
--- a/docs/CMakeLists.txt
+++ b/docs/CMakeLists.txt
@@ -23,7 +23,6 @@ foreach (src ${DOC_SOURCES})
  endforeach()
  
  ADD_CUSTOM_TARGET(doc
-   COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Centos-8.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Centos-8.md
     COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-18.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-18.md
     COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-20.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-20.md
     COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/bash2md.sh ${PROJECT_SOURCE_DIR}/vagrant/Install-on-Ubuntu-22.sh ${CMAKE_CURRENT_BINARY_DIR}/appendix/Install-on-Ubuntu-22.md
diff --git a/docs/admin/Advanced-Installations.md b/docs/admin/Advanced-Installations.md

index aeb2fa5b0e79ffbf526f709b69bb8d0a4cc2d9ac..08c059841f48df4d40f45f1eef74f629f8b5d91f 100644 (file)
--- a/docs/admin/Advanced-Installations.md
+++ b/docs/admin/Advanced-Installations.md
@@ -111,7 +111,7 @@ library.
  
  !!! note
      The external module is only needed when using the legacy tokenizer.
-    If you have choosen the ICU tokenizer, then you can ignore this section
+    If you have chosen the ICU tokenizer, then you can ignore this section
      and follow the standard import documentation.
  
  ### Option 1: Compiling the library on the database server
diff --git a/docs/admin/Deployment.md b/docs/admin/Deployment.md

index ef4fc0b32b249d2e5f2d94448a1578bf5d000ca7..5dcbcde96485fdcfbcbe80f1ee95558c3cb63020 100644 (file)
--- a/docs/admin/Deployment.md
+++ b/docs/admin/Deployment.md
@@ -82,7 +82,7 @@ The website should now be available on `http://localhost/nominatim`.
  
  ### Installing the required packages
  
-Nginx has no built-in PHP interpreter. You need to use php-fpm as a deamon for
+Nginx has no built-in PHP interpreter. You need to use php-fpm as a daemon for
  serving PHP cgi.
  
  On Ubuntu/Debian install nginx and php-fpm with:
diff --git a/docs/admin/Installation.md b/docs/admin/Installation.md

index 96546cf3de57dabd1f27b338220767216235f2d7..e77dccf480b4bd8c3bfb92cf2dacd81c1ab443e9 100644 (file)
--- a/docs/admin/Installation.md
+++ b/docs/admin/Installation.md
@@ -6,7 +6,6 @@ the following operating systems:
  
    * [Ubuntu 20.04](../appendix/Install-on-Ubuntu-20.md)
    * [Ubuntu 18.04](../appendix/Install-on-Ubuntu-18.md)
-  * [CentOS 8](../appendix/Install-on-Centos-8.md)
  
  These OS-specific instructions can also be found in executable form
  in the `vagrant/` directory.
@@ -24,9 +23,9 @@ and can't offer support.
  ### Software
  
  !!! Warning
-    For larger installations you **must have** PostgreSQL 11+ and Postgis 3+
+    For larger installations you **must have** PostgreSQL 11+ and PostGIS 3+
      otherwise import and queries will be slow to the point of being unusable.
-    Query performance has marked improvements with PostgrSQL 13+ and Postgis 3.2+.
+    Query performance has marked improvements with PostgreSQL 13+ and PostGIS 3.2+.
  
  For compiling:
  
@@ -67,10 +66,10 @@ the [Development section](../develop/Development-Environment.md).
  ### Hardware
  
  A minimum of 2GB of RAM is required or installation will fail. For a full
-planet import 64GB of RAM or more are strongly recommended. Do not report
+planet import 128GB of RAM or more are strongly recommended. Do not report
  out of memory problems if you have less than 64GB RAM.
  
-For a full planet install you will need at least 900GB of hard disk space.
+For a full planet install you will need at least 1TB of hard disk space.
  Take into account that the OSM database is growing fast.
  Fast disks are essential. Using NVME disks is recommended.
  
@@ -112,7 +111,7 @@ For the initial import, you should also set:
      fsync = off
      full_page_writes = off
  
-Don't forget to reenable them after the initial import or you risk database
+Don't forget to re-enable them after the initial import or you risk database
  corruption.
  
  
diff --git a/docs/admin/Setup-Nominatim-UI.md b/docs/admin/Setup-Nominatim-UI.md

index 7f0126603a80380cd826908b5c6a4cb6f78ac903..cab2d2ed92fcbaaaeb611062c44f54fc9da78c25 100644 (file)
--- a/docs/admin/Setup-Nominatim-UI.md
+++ b/docs/admin/Setup-Nominatim-UI.md
@@ -161,7 +161,7 @@ directory like this:
    # If no endpoint is given, then use search.
    RewriteRule ^(/|$)   "search.php"
  
-  # If format-html is explicity requested, forward to the UI.
+  # If format-html is explicitly requested, forward to the UI.
    RewriteCond %{QUERY_STRING} "format=html"
    RewriteRule ^([^/]+)(.php)? ui/$1.html [R,END]
  
diff --git a/docs/admin/Update.md b/docs/admin/Update.md

index add1df5c453583a67651680153f71ad506f25afc..f8c8f767969840ed025e62dc989f31be42b0f4bf 100644 (file)
--- a/docs/admin/Update.md
+++ b/docs/admin/Update.md
@@ -215,7 +215,7 @@ replication catch-up at whatever interval you desire.
      a replication source with an update frequency that is an order of magnitude
      lower. For example, if you want to update once a day, use an hourly updated
      source. This makes sure that you don't miss an entire day of updates when
-    the source is unexpectely late to publish its update.
+    the source is unexpectedly late to publish its update.
  
      If you want to use the source with the same update frequency (e.g. a daily
      updated source with daily updates), use the
diff --git a/docs/api/Output.md b/docs/api/Output.md

index d59f75dd5156dc334bfba818ce42ac24e7b37843..4f5399f0ddc4c05275b1b3af4d64cdfee2bc4aad 100644 (file)
--- a/docs/api/Output.md
+++ b/docs/api/Output.md
@@ -236,7 +236,7 @@ on another server. It may even change its ID on the same server when it is
  removed and reimported while updating the database with fresh OSM data.
  It is thus not useful to treat it as permanent for later use.
  
-The combination `osm_type`+`osm_id` is slighly better but remember in
+The combination `osm_type`+`osm_id` is slightly better but remember in
  OpenStreetMap mappers can delete, split, recreate places (and those
  get a new `osm_id`), there is no link between those old and new ids.
  Places can also change their meaning without changing their `osm_id`,
@@ -290,7 +290,7 @@ with a designation label. Per default the following labels may appear:
   * city_district, district, borough, suburb, subdivision
   * hamlet, croft, isolated_dwelling
   * neighbourhood, allotments, quarter
- * city_block, residental, farm, farmyard, industrial, commercial, retail
+ * city_block, residential, farm, farmyard, industrial, commercial, retail
   * road
   * house_number, house_name
   * emergency, historic, military, natural, landuse, place, railway,
diff --git a/docs/customize/Import-Styles.md b/docs/customize/Import-Styles.md

index 89171a4dbb66817b05da5f76374a70b4462b603d..fcd02ae18e3a92bee7a85403f93406ea031b3a82 100644 (file)
--- a/docs/customize/Import-Styles.md
+++ b/docs/customize/Import-Styles.md
@@ -10,7 +10,7 @@ option. There are a number of default styles, which are explained in detail
  in the [Import section](../admin/Import.md#filtering-imported-data). These
  standard styles may be referenced by their name.
  
-You can also create your own custom syle. Put the style file into your
+You can also create your own custom style. Put the style file into your
  project directory and then set `NOMINATIM_IMPORT_STYLE` to the name of the file.
  It is always recommended to start with one of the standard styles and customize
  those. You find the standard styles under the name `import-<stylename>.style`
diff --git a/docs/develop/Database-Layout.md b/docs/develop/Database-Layout.md

index fcd9c3b36d43fc3f32141252fdaccd3a1852b9b8..98413a0bfa9864dbc5e83673841f68ff31db8df9 100644 (file)
--- a/docs/develop/Database-Layout.md
+++ b/docs/develop/Database-Layout.md
@@ -119,7 +119,7 @@ to compute the address relations between places. These tables are partitioned.
  Each country is assigned a partition number in the country_name table (see
  below) and the data is then split between a set of tables, one for each
  partition. Note that Nominatim still manually manages partitioned tables.
-Native support for partitions in PostgreSQL only became useable with version 13.
+Native support for partitions in PostgreSQL only became usable with version 13.
  It will be a little while before Nominatim drops support for older versions.
  
  ![address tables](address-tables.svg)
@@ -155,9 +155,9 @@ Nominatim also creates a number of static tables at import:
     default languages and saves the assignment of countries to partitions.
   * `country_osm_grid` provides a fallback for country geometries
  
-## Auxilary data tables
+## Auxiliary data tables
  
-Finally there are some table for auxillary data:
+Finally there are some table for auxiliary data:
  
   * `location_property_tiger` - saves housenumber from the Tiger import. Its
     layout is similar to that of `location_propoerty_osmline`.
diff --git a/docs/develop/Development-Environment.md b/docs/develop/Development-Environment.md

index 65dc79907c8bcdbb697d80803eaeeb5520182c82..58f802f17020a3f1c12dedd1c2fdb510eb7bec9d 100644 (file)
--- a/docs/develop/Development-Environment.md
+++ b/docs/develop/Development-Environment.md
@@ -1,6 +1,6 @@
  # Setting up Nominatim for Development
  
-This chapter gives an overview how to set up Nominatim for developement
+This chapter gives an overview how to set up Nominatim for development
  and how to run tests.
  
  !!! Important
@@ -40,7 +40,8 @@ It has the following additional requirements:
  The documentation is built with mkdocs:
  
  * [mkdocs](https://www.mkdocs.org/) >= 1.1.2
-* [mkdocstrings](https://mkdocstrings.github.io/)
+* [mkdocstrings](https://mkdocstrings.github.io/) >= 0.16
+* [mkdocstrings-python-legacy](https://mkdocstrings.github.io/python-legacy/)
  
  ### Installing prerequisites on Ubuntu/Debian
  
diff --git a/docs/develop/ICU-Tokenizer-Modules.md b/docs/develop/ICU-Tokenizer-Modules.md

new file mode 100644 (file)

index 0000000..2cf30a5
--- /dev/null
+++ b/docs/develop/ICU-Tokenizer-Modules.md
@@ -0,0 +1,227 @@
+# Writing custom sanitizer and token analysis modules for the ICU tokenizer
+
+The [ICU tokenizer](../customize/Tokenizers.md#icu-tokenizer) provides a
+highly customizable method to pre-process and normalize the name information
+of the input data before it is added to the search index. It comes with a
+selection of sanitizers and token analyzers which you can use to adapt your
+installation to your needs. If the provided modules are not enough, you can
+also provide your own implementations. This section describes the API
+of sanitizers and token analysis.
+
+!!! warning
+    This API is currently in early alpha status. While this API is meant to
+    be a public API on which other sanitizers and token analyzers may be
+    implemented, it is not guaranteed to be stable at the moment.
+
+
+## Using non-standard sanitizers and token analyzers
+
+Sanitizer names (in the `step` property) and token analysis names (in the
+`analyzer`) may refer to externally supplied modules. There are two ways
+to include external modules: through a library or from the project directory.
+
+To include a module from a library, use the absolute import path as name and
+make sure the library can be found in your PYTHONPATH.
+
+To use a custom module without creating a library, you can put the module
+somewhere in your project directory and then use the relative path to the
+file. Include the whole name of the file including the `.py` ending.
+
+## Custom sanitizer modules
+
+A sanitizer module must export a single factory function `create` with the
+following signature:
+
+``` python
+def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None]
+```
+
+The function receives the custom configuration for the sanitizer and must
+return a callable (function or class) that transforms the name and address
+terms of a place. When a place is processed, then a `ProcessInfo` object
+is created from the information that was queried from the database. This
+object is sequentially handed to each configured sanitizer, so that each
+sanitizer receives the result of processing from the previous sanitizer.
+After the last sanitizer is finished, the resulting name and address lists
+are forwarded to the token analysis module.
+
+Sanitizer functions are instantiated once and then called for each place
+that is imported or updated. They don't need to be thread-safe.
+If multi-threading is used, each thread creates their own instance of
+the function.
+
+### Sanitizer configuration
+
+::: nominatim.tokenizer.sanitizers.config.SanitizerConfig
+    rendering:
+        show_source: no
+        heading_level: 6
+
+### The main filter function of the sanitizer
+
+The filter function receives a single object of type `ProcessInfo`
+which has with three members:
+
+ * `place`: read-only information about the place being processed.
+   See PlaceInfo below.
+ * `names`: The current list of names for the place. Each name is a
+   PlaceName object.
+ * `address`: The current list of address names for the place. Each name
+   is a PlaceName object.
+
+While the `place` member is provided for information only, the `names` and
+`address` lists are meant to be manipulated by the sanitizer. It may add and
+remove entries, change information within a single entry (for example by
+adding extra attributes) or completely replace the list with a different one.
+
+#### PlaceInfo - information about the place
+
+::: nominatim.data.place_info.PlaceInfo
+    rendering:
+        show_source: no
+        heading_level: 6
+
+
+#### PlaceName - extended naming information
+
+::: nominatim.data.place_name.PlaceName
+    rendering:
+        show_source: no
+        heading_level: 6
+
+
+### Example: Filter for US street prefixes
+
+The following sanitizer removes the directional prefixes from street names
+in the US:
+
+``` python
+import re
+
+def _filter_function(obj):
+    if obj.place.country_code == 'us' \
+       and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
+        for name in obj.names:
+            name.name = re.sub(r'^(north|south|west|east) ',
+                               '',
+                               name.name,
+                               flags=re.IGNORECASE)
+
+def create(config):
+    return _filter_function
+```
+
+This is the most simple form of a sanitizer module. If defines a single
+filter function and implements the required `create()` function by returning
+the filter.
+
+The filter function first checks if the object is interesting for the
+sanitizer. Namely it checks if the place is in the US (through `country_code`)
+and it the place is a street (a `rank_address` of 26 or 27). If the
+conditions are met, then it goes through all available names and
+removes any leading directional prefix using a simple regular expression.
+
+Save the source code in a file in your project directory, for example as
+`us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`:
+
+``` yaml
+...
+sanitizers:
+    - step: us_streets.py
+...
+```
+
+!!! warning
+    This example is just a simplified show case on how to create a sanitizer.
+    It is not really read for real-world use: while the sanitizer would
+    correcly transform `West 5th Street` into `5th Street`. it would also
+    shorten a simple `North Street` to `Street`.
+
+For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
+They can be found in the directory
+[`nominatim/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/nominatim/tokenizer/sanitizers).
+
+
+## Custom token analysis module
+
+::: nominatim.tokenizer.token_analysis.base.AnalysisModule
+    rendering:
+        show_source: no
+        heading_level: 6
+
+
+::: nominatim.tokenizer.token_analysis.base.Analyzer
+    rendering:
+        show_source: no
+        heading_level: 6
+
+### Example: Creating acronym variants for long names
+
+The following example of a token analysis module creates acronyms from
+very long names and adds them as a variant:
+
+``` python
+class AcronymMaker:
+    """ This class is the actual analyzer.
+    """
+    def __init__(self, norm, trans):
+        self.norm = norm
+        self.trans = trans
+
+
+    def get_canonical_id(self, name):
+        # In simple cases, the normalized name can be used as a canonical id.
+        return self.norm.transliterate(name.name).strip()
+
+
+    def compute_variants(self, name):
+        # The transliterated form of the name always makes up a variant.
+        variants = [self.trans.transliterate(name)]
+
+        # Only create acronyms from very long words.
+        if len(name) > 20:
+            # Take the first letter from each word to form the acronym.
+            acronym = ''.join(w[0] for w in name.split())
+            # If that leds to an acronym with at least three letters,
+            # add the resulting acronym as a variant.
+            if len(acronym) > 2:
+                # Never forget to transliterate the variants before returning them.
+                variants.append(self.trans.transliterate(acronym))
+
+        return variants
+
+# The following two functions are the module interface.
+
+def configure(rules, normalizer, transliterator):
+    # There is no configuration to parse and no data to set up.
+    # Just return an empty configuration.
+    return None
+
+
+def create(normalizer, transliterator, config):
+    # Return a new instance of our token analysis class above.
+    return AcronymMaker(normalizer, transliterator)
+```
+
+Given the name `Trans-Siberian Railway`, the code above would return the full
+name `Trans-Siberian Railway` and the acronym `TSR` as variant, so that
+searching would work for both.
+
+## Sanitizers vs. Token analysis - what to use for variants?
+
+It is not always clear when to implement variations in the sanitizer and
+when to write a token analysis module. Just take the acronym example
+above: it would also have been possible to write a sanitizer which adds the
+acronym as an additional name to the name list. The result would have been
+similar. So which should be used when?
+
+The most important thing to keep in mind is that variants created by the
+token analysis are only saved in the word lookup table. They do not need
+extra space in the search index. If there are many spelling variations, this
+can mean quite a significant amount of space is saved.
+
+When creating additional names with a sanitizer, these names are completely
+independent. In particular, they can be fed into different token analysis
+modules. This gives a much greater flexibility but at the price that the
+additional names increase the size of the search index.
+
diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md

index 5fe4e38d436b2978ce334a258fdaa80c3a9a9e58..273e65e2126381a65b2d27361d71269c2ec439c5 100644 (file)
--- a/docs/develop/Tokenizers.md
+++ b/docs/develop/Tokenizers.md
@@ -105,7 +105,7 @@ functions. By convention, these should be placed in `lib-sql/tokenizer`.
  If the tokenizer has a default configuration file, this should be saved in
  the `settings/<NAME>_tokenizer.<SUFFIX>`.
  
-### Configuration and Persistance
+### Configuration and Persistence
  
  Tokenizers may define custom settings for their configuration. All settings
  must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
diff --git a/docs/develop/data-sources.md b/docs/develop/data-sources.md

index bc77da0324e7a04467ecfb9f302935237226e79a..a04fb0389d3e52d5f61aa3dd7db7d65a8ee0434a 100644 (file)
--- a/docs/develop/data-sources.md
+++ b/docs/develop/data-sources.md
@@ -13,7 +13,7 @@ More details in [osm-search/country-grid-data](https://github.com/osm-search/cou
  
  ## US Census TIGER
  
-For the United States you can choose to import additonal street-level data.
+For the United States you can choose to import additional street-level data.
  The data isn't mixed into OSM data but queried as fallback when no OSM
  result can be found.
  
diff --git a/docs/extra.css b/docs/extra.css

index 9289c1d39884909c0d6d4c4a0209f8cec1039c97..3aecf2ef750e2eae298708eb7e2b8b20cdc9d8ac 100644 (file)
--- a/docs/extra.css
+++ b/docs/extra.css
@@ -14,10 +14,11 @@ th {
      background-color: #eee;
  }
  
-/* Indentation for mkdocstrings.
-div.doc-contents:not(.first) {
-  padding-left: 25px;
-  border-left: 4px solid rgba(230, 230, 230);
-  margin-bottom: 60px;
-}*/
+.doc-object h6 {
+    margin-bottom: 0.8em;
+    font-size: 120%;
+}
  
+.doc-object {
+    margin-bottom: 1.3em;
+}
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml

index 48fe1d0d165f99b6b8a7b704ad620e7a4367d062..43bb533d6f8ed59a26429456ac710238d3f105f5 100644 (file)
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -39,6 +39,7 @@ nav:
          - 'Database Layout' : 'develop/Database-Layout.md'
          - 'Indexing' : 'develop/Indexing.md'
          - 'Tokenizers' : 'develop/Tokenizers.md'
+        - 'Custom modules for ICU tokenizer': 'develop/ICU-Tokenizer-Modules.md'
          - 'Setup for Development' : 'develop/Development-Environment.md'
          - 'Testing' : 'develop/Testing.md'
          - 'External Data Sources': 'develop/data-sources.md'
@@ -58,7 +59,7 @@ plugins:
      - search
      - mkdocstrings:
          handlers:
-          python:
+          python-legacy:
              rendering:
                show_source: false
                show_signature_annotations: false
diff --git a/lib-php/Geocode.php b/lib-php/Geocode.php

index 85d871f0d039f1b37072901f1ae09ae0650d9c64..978ca679010ec5b9c6e5c166d2507ed3916865b4 100644 (file)
--- a/lib-php/Geocode.php
+++ b/lib-php/Geocode.php
@@ -190,7 +190,7 @@ class Geocode
  
          $this->bFallback = $oParams->getBool('fallback', $this->bFallback);
  
-        // List of excluded Place IDs - used for more acurate pageing
+        // List of excluded Place IDs - used for more accurate pageing
          $sExcluded = $oParams->getStringList('exclude_place_ids');
          if ($sExcluded) {
              foreach ($sExcluded as $iExcludedPlaceID) {
diff --git a/lib-php/ParameterParser.php b/lib-php/ParameterParser.php

index 070be36c24a3856448ef3de965298690001dac33..a4936d376d3cb773e31b9acf5386096fc209fc1a 100644 (file)
--- a/lib-php/ParameterParser.php
+++ b/lib-php/ParameterParser.php
@@ -22,7 +22,10 @@ class ParameterParser
  
      public function getBool($sName, $bDefault = false)
      {
-        if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+        if (!isset($this->aParams[$sName])
+            || !is_string($this->aParams[$sName])
+            || strlen($this->aParams[$sName]) == 0
+        ) {
              return $bDefault;
          }
  
@@ -31,7 +34,7 @@ class ParameterParser
  
      public function getInt($sName, $bDefault = false)
      {
-        if (!isset($this->aParams[$sName])) {
+        if (!isset($this->aParams[$sName]) || is_array($this->aParams[$sName])) {
              return $bDefault;
          }
  
@@ -44,7 +47,7 @@ class ParameterParser
  
      public function getFloat($sName, $bDefault = false)
      {
-        if (!isset($this->aParams[$sName])) {
+        if (!isset($this->aParams[$sName]) || is_array($this->aParams[$sName])) {
              return $bDefault;
          }
  
@@ -57,7 +60,10 @@ class ParameterParser
  
      public function getString($sName, $bDefault = false)
      {
-        if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+        if (!isset($this->aParams[$sName])
+            || !is_string($this->aParams[$sName])
+            || strlen($this->aParams[$sName]) == 0
+        ) {
              return $bDefault;
          }
  
@@ -66,7 +72,10 @@ class ParameterParser
  
      public function getSet($sName, $aValues, $sDefault = false)
      {
-        if (!isset($this->aParams[$sName]) || strlen($this->aParams[$sName]) == 0) {
+        if (!isset($this->aParams[$sName])
+            || !is_string($this->aParams[$sName])
+            || strlen($this->aParams[$sName]) == 0
+        ) {
              return $sDefault;
          }
  
diff --git a/lib-php/Phrase.php b/lib-php/Phrase.php

index ec776d1695df4588c9ef852984a1b75627c2c60d..4ed4d402f68c7c4e1c8dff5abd808087ef7ccfaa 100644 (file)
--- a/lib-php/Phrase.php
+++ b/lib-php/Phrase.php
@@ -32,7 +32,7 @@ class Phrase
      }
  
      /**
-     * Get the orginal phrase of the string.
+     * Get the original phrase of the string.
       */
      public function getPhrase()
      {
diff --git a/lib-php/ReverseGeocode.php b/lib-php/ReverseGeocode.php

index 35103aeb3f78681e93fc5e3aa0c287fb498739df..77c16a5b09fe05b5ca5e9add4d06cb1dfbaba173 100644 (file)
--- a/lib-php/ReverseGeocode.php
+++ b/lib-php/ReverseGeocode.php
@@ -265,7 +265,7 @@ class ReverseGeocode
              // starts if the search is on POI or street level,
              // searches for the nearest POI or street,
              // if a street is found and a POI is searched for,
-            // the nearest POI which the found street is a parent of is choosen.
+            // the nearest POI which the found street is a parent of is chosen.
              $sSQL = 'select place_id,parent_place_id,rank_address,country_code,';
              $sSQL .= ' ST_distance('.$sPointSQL.', geometry) as distance';
              $sSQL .= ' FROM ';
@@ -359,7 +359,7 @@ class ReverseGeocode
                      // We can't reliably go from the closest street to an
                      // interpolation line because the closest interpolation
                      // may have a different street segments as a parent.
-                    // Therefore allow an interpolation line to take precendence
+                    // Therefore allow an interpolation line to take precedence
                      // even when the street is closer.
                      $fDistance = $iRankAddress < 28 ? 0.001 : $aPlace['distance'];
                  }
diff --git a/lib-php/SearchDescription.php b/lib-php/SearchDescription.php

index b98c2e723ae7e4419facfed29ddb426725af314c..5d2caf0057b4585ec86dba94381526ad9afce07f 100644 (file)
--- a/lib-php/SearchDescription.php
+++ b/lib-php/SearchDescription.php
@@ -236,7 +236,7 @@ class SearchDescription
       * Add the given full-word token to the list of terms to search for in the
       * name.
       *
-     * @param interger iId    ID of term to add.
+     * @param integer iId    ID of term to add.
       * @param bool bRareName  True if the term is infrequent enough to not
       *                        require other constraints for efficient search.
       */
@@ -387,7 +387,7 @@ class SearchDescription
       *
       * @return mixed[] An array with two fields: IDs contains the list of
       *                 matching place IDs and houseNumber the houseNumber
-     *                 if appicable or -1 if not.
+     *                 if applicable or -1 if not.
       */
      public function query(&$oDB, $iMinRank, $iMaxRank, $iLimit)
      {
diff --git a/lib-php/TokenWord.php b/lib-php/TokenWord.php

index e2f7aa4d765660713764550f204b9c249e213d49..a7557d38b51c8c97a2cf12baf1c7776cb5bc25c8 100644 (file)
--- a/lib-php/TokenWord.php
+++ b/lib-php/TokenWord.php
@@ -62,7 +62,7 @@ class Word
      public function extendSearch($oSearch, $oPosition)
      {
          // Full words can only be a name if they appear at the beginning
-        // of the phrase. In structured search the name must forcably in
+        // of the phrase. In structured search the name must forcibly in
          // the first phrase. In unstructured search it may be in a later
          // phrase when the first phrase is a house number.
          if ($oSearch->hasName()
diff --git a/lib-php/cmd.php b/lib-php/cmd.php

index 922193bccb32cde13122917737dd268772e21f3a..6f1299dd16b8549f6b473adfbb77af851bb111f6 100644 (file)
--- a/lib-php/cmd.php
+++ b/lib-php/cmd.php
@@ -106,7 +106,7 @@ function getCmdOpt($aArg, $aSpec, &$aResult, $bExitOnError = false, $bExitOnUnkn
                  showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is missing');
              }
              if ($aCounts[$aLine[0]] > $aLine[3]) {
-                showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is pressent too many times');
+                showUsage($aSpec, $bExitOnError, 'Option \''.$aLine[0].'\' is present too many times');
              }
              if ($aLine[6] == 'bool' && !array_key_exists($aLine[0], $aResult)) {
                  $aResult[$aLine[0]] = false;
diff --git a/lib-php/lib.php b/lib-php/lib.php

index d17c9d72b8dacf0a97d9bd7c5e2a5ad3e2f42626..f7c6e55e3c115ef7e1b3daf2fb694bb168d4a5fa 100644 (file)
--- a/lib-php/lib.php
+++ b/lib-php/lib.php
@@ -11,7 +11,7 @@
  function loadSettings($sProjectDir)
  {
      @define('CONST_InstallDir', $sProjectDir);
-    // Temporary hack to set the direcory via environment instead of
+    // Temporary hack to set the directory via environment instead of
      // the installed scripts. Neither setting is part of the official
      // set of settings.
      defined('CONST_ConfigDir') or define('CONST_ConfigDir', $_SERVER['NOMINATIM_CONFIGDIR']);
diff --git a/lib-php/website/details.php b/lib-php/website/details.php

index 1b02a2025e6539f6f0aff9e1bb47c4c3c69030bf..99307bbd9fb229a09e01b278872839f6b7512198 100644 (file)
--- a/lib-php/website/details.php
+++ b/lib-php/website/details.php
@@ -206,7 +206,7 @@ if ($bIncludeLinkedPlaces) {
      $aLinkedLines = $oDB->getAll($sSQL);
  }
  
-// All places this is an imediate parent of
+// All places this is an immediate parent of
  $aHierarchyLines = false;
  if ($bIncludeHierarchy) {
      $sSQL = 'SELECT obj.place_id, osm_type, osm_id, class, type, housenumber,';
diff --git a/lib-sql/functions/partition-functions.sql b/lib-sql/functions/partition-functions.sql

index ec762f4f7f60f5b5c755c2cf6b8f816f7742de2c..20ec3da6bb6853ec38c334ec483f784320e7dc0b 100644 (file)
--- a/lib-sql/functions/partition-functions.sql
+++ b/lib-sql/functions/partition-functions.sql
@@ -17,7 +17,7 @@ CREATE TYPE nearfeaturecentr AS (
    centroid GEOMETRY
  );
  
--- feature intersects geoemtry
+-- feature intersects geometry
  -- for areas and linestrings they must touch at least along a line
  CREATE OR REPLACE FUNCTION is_relevant_geometry(de9im TEXT, geom_type TEXT)
  RETURNS BOOLEAN
diff --git a/lib-sql/functions/placex_triggers.sql b/lib-sql/functions/placex_triggers.sql

index 1f7e6dc61a0e99fce95aa31c7aad24707df409fe..29f645cb005af85b607b2c2a5ef1eab3f8a89755 100644 (file)
--- a/lib-sql/functions/placex_triggers.sql
+++ b/lib-sql/functions/placex_triggers.sql
@@ -47,7 +47,7 @@ BEGIN
                 and rank_search = 30 AND ST_GeometryType(geometry) in ('ST_Polygon','ST_MultiPolygon')
           LIMIT 1;
      ELSE
-      -- See if we can inherit addtional address tags from an interpolation.
+      -- See if we can inherit additional address tags from an interpolation.
        -- These will become permanent.
        FOR location IN
          SELECT (address - 'interpolation'::text - 'housenumber'::text) as address
@@ -1032,7 +1032,7 @@ BEGIN
    {% if debug %}RAISE WARNING 'Using full index mode for % %', NEW.osm_type, NEW.osm_id;{% endif %}
    IF linked_place is not null THEN
      -- Recompute the ranks here as the ones from the linked place might
-    -- have been shifted to accomodate surrounding boundaries.
+    -- have been shifted to accommodate surrounding boundaries.
      SELECT place_id, osm_id, class, type, extratags,
             centroid, geometry,
             (compute_place_rank(country_code, osm_type, class, type, admin_level,
@@ -1103,7 +1103,7 @@ BEGIN
    THEN
      -- Update the list of country names.
      -- Only take the name from the largest area for the given country code
-    -- in the hope that this is the authoritive one.
+    -- in the hope that this is the authoritative one.
      -- Also replace any old names so that all mapping mistakes can
      -- be fixed through regular OSM updates.
      FOR location IN
@@ -1191,7 +1191,7 @@ BEGIN
      NEW.postcode := get_nearest_postcode(NEW.country_code, NEW.geometry);
    END IF;
  
-  {% if debug %}RAISE WARNING 'place update % % finsihed.', NEW.osm_type, NEW.osm_id;{% endif %}
+  {% if debug %}RAISE WARNING 'place update % % finished.', NEW.osm_type, NEW.osm_id;{% endif %}
  
    NEW.token_info := token_strip_info(NEW.token_info);
    RETURN NEW;
diff --git a/munin/nominatim_requests b/munin/nominatim_requests

index ac083fc2d5acabe69279407779b19ba411ab2a4b..8a6d044ccd91ce104d5b08ad371a206f6ebe29ee 100755 (executable)
--- a/munin/nominatim_requests
+++ b/munin/nominatim_requests
@@ -1,6 +1,6 @@
  #!/bin/sh
  #
-# Plugin to monitor the types of requsts made to the API
+# Plugin to monitor the types of requests made to the API
  #
  # Can be configured through libpq environment variables, for example
  # PGUSER, PGDATABASE, etc. See man page of psql for more information.
diff --git a/nominatim/config.py b/nominatim/config.py

index 78496550ddc7bf865541eff3737bb5d2b298abe3..7502aff703ebaaf4610ddc4b591d212957c1229c 100644 (file)
--- a/nominatim/config.py
+++ b/nominatim/config.py
@@ -8,8 +8,10 @@
  Nominatim configuration accessor.
  """
  from typing import Dict, Any, List, Mapping, Optional
+import importlib.util
  import logging
  import os
+import sys
  from pathlib import Path
  import json
  import yaml
@@ -73,6 +75,7 @@ class Configuration:
              data: Path
  
          self.lib_dir = _LibDirs()
+        self._private_plugins: Dict[str, object] = {}
  
  
      def set_libdirs(self, **kwargs: StrPath) -> None:
@@ -185,7 +188,7 @@ class Configuration:
                                 config: Optional[str] = None) -> Any:
          """ Load additional configuration from a file. `filename` is the name
              of the configuration file. The file is first searched in the
-            project directory and then in the global settings dirctory.
+            project directory and then in the global settings directory.
  
              If `config` is set, then the name of the configuration file can
              be additionally given through a .env configuration option. When
@@ -219,6 +222,49 @@ class Configuration:
          return result
  
  
+    def load_plugin_module(self, module_name: str, internal_path: str) -> Any:
+        """ Load a Python module as a plugin.
+
+            The module_name may have three variants:
+
+            * A name without any '.' is assumed to be an internal module
+              and will be searched relative to `internal_path`.
+            * If the name ends in `.py`, module_name is assumed to be a
+              file name relative to the project directory.
+            * Any other name is assumed to be an absolute module name.
+
+            In either of the variants the module name must start with a letter.
+        """
+        if not module_name or not module_name[0].isidentifier():
+            raise UsageError(f'Invalid module name {module_name}')
+
+        if '.' not in module_name:
+            module_name = module_name.replace('-', '_')
+            full_module = f'{internal_path}.{module_name}'
+            return sys.modules.get(full_module) or importlib.import_module(full_module)
+
+        if module_name.endswith('.py'):
+            if self.project_dir is None or not (self.project_dir / module_name).exists():
+                raise UsageError(f"Cannot find module '{module_name}' in project directory.")
+
+            if module_name in self._private_plugins:
+                return self._private_plugins[module_name]
+
+            file_path = str(self.project_dir / module_name)
+            spec = importlib.util.spec_from_file_location(module_name, file_path)
+            if spec:
+                module = importlib.util.module_from_spec(spec)
+                # Do not add to global modules because there is no standard
+                # module name that Python can resolve.
+                self._private_plugins[module_name] = module
+                assert spec.loader is not None
+                spec.loader.exec_module(module)
+
+                return module
+
+        return sys.modules.get(module_name) or importlib.import_module(module_name)
+
+
      def find_config_file(self, filename: StrPath,
                           config: Optional[str] = None) -> Path:
          """ Resolve the location of a configuration file given a filename and
@@ -266,7 +312,7 @@ class Configuration:
          """ Handler for the '!include' operator in YAML files.
  
              When the filename is relative, then the file is first searched in the
-            project directory and then in the global settings dirctory.
+            project directory and then in the global settings directory.
          """
          fname = loader.construct_scalar(node)
  
diff --git a/nominatim/data/place_info.py b/nominatim/data/place_info.py

index 96912a61e36176900f7b557fd0e70a838af58784..ab895352314581bacb84e23f13ebfa6aada2fe70 100644 (file)
--- a/nominatim/data/place_info.py
+++ b/nominatim/data/place_info.py
@@ -11,8 +11,8 @@ the tokenizer.
  from typing import Optional, Mapping, Any
  
  class PlaceInfo:
-    """ Data class containing all information the tokenizer gets about a
-        place it should process the names for.
+    """ This data class contains all information the tokenizer can access
+        about a place.
      """
  
      def __init__(self, info: Mapping[str, Any]) -> None:
@@ -21,16 +21,25 @@ class PlaceInfo:
  
      @property
      def name(self) -> Optional[Mapping[str, str]]:
-        """ A dictionary with the names of the place or None if the place
-            has no names.
+        """ A dictionary with the names of the place. Keys and values represent
+            the full key and value of the corresponding OSM tag. Which tags
+            are saved as names is determined by the import style.
+            The property may be None if the place has no names.
          """
          return self._info.get('name')
  
  
      @property
      def address(self) -> Optional[Mapping[str, str]]:
-        """ A dictionary with the address elements of the place
-            or None if no address information is available.
+        """ A dictionary with the address elements of the place. They key
+            usually corresponds to the suffix part of the key of an OSM
+            'addr:*' or 'isin:*' tag. There are also some special keys like
+            `country` or `country_code` which merge OSM keys that contain
+            the same information. See [Import Styles][1] for details.
+
+            The property may be None if the place has no address information.
+
+            [1]: ../customize/Import-Styles.md
          """
          return self._info.get('address')
  
@@ -38,28 +47,30 @@ class PlaceInfo:
      @property
      def country_code(self) -> Optional[str]:
          """ The country code of the country the place is in. Guaranteed
-            to be a two-letter lower-case string or None, if no country
-            could be found.
+            to be a two-letter lower-case string. If the place is not inside
+            any country, the property is set to None.
          """
          return self._info.get('country_code')
  
  
      @property
      def rank_address(self) -> int:
-        """ The computed rank address before rank correction.
+        """ The [rank address][1] before ant rank correction is applied.
+
+            [1]: ../customize/Ranking.md#address-rank
          """
          return self._info.get('rank_address', 0)
  
  
      def is_a(self, key: str, value: str) -> bool:
-        """ Check if the place's primary tag corresponds to the given
+        """ Set to True when the place's primary tag corresponds to the given
              key and value.
          """
          return self._info.get('class') == key and self._info.get('type') == value
  
  
      def is_country(self) -> bool:
-        """ Check if the place is a valid country boundary.
+        """ Set to True when the place is a valid country boundary.
          """
          return self.rank_address == 4 \
                 and self.is_a('boundary', 'administrative') \
diff --git a/nominatim/data/place_name.py b/nominatim/data/place_name.py

new file mode 100644 (file)

index 0000000..f4c5e0f
--- /dev/null
+++ b/nominatim/data/place_name.py
@@ -0,0 +1,78 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This file is part of Nominatim. (https://nominatim.org)
+#
+# Copyright (C) 2022 by the Nominatim developer community.
+# For a full list of authors see the git log.
+"""
+Data class for a single name of a place.
+"""
+from typing import Optional, Dict, Mapping
+
+class PlaceName:
+    """ Each name and address part of a place is encapsulated in an object of
+        this class. It saves not only the name proper but also describes the
+        kind of name with two properties:
+
+        * `kind` describes the name of the OSM key used without any suffixes
+          (i.e. the part after the colon removed)
+        * `suffix` contains the suffix of the OSM tag, if any. The suffix
+          is the part of the key after the first colon.
+
+        In addition to that, a name may have arbitrary additional attributes.
+        How attributes are used, depends on the sanitizers and token analysers.
+        The exception is is the 'analyzer' attribute. This attribute determines
+        which token analysis module will be used to finalize the treatment of
+        names.
+    """
+
+    def __init__(self, name: str, kind: str, suffix: Optional[str]):
+        self.name = name
+        self.kind = kind
+        self.suffix = suffix
+        self.attr: Dict[str, str] = {}
+
+
+    def __repr__(self) -> str:
+        return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
+
+
+    def clone(self, name: Optional[str] = None,
+              kind: Optional[str] = None,
+              suffix: Optional[str] = None,
+              attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
+        """ Create a deep copy of the place name, optionally with the
+            given parameters replaced. In the attribute list only the given
+            keys are updated. The list is not replaced completely.
+            In particular, the function cannot to be used to remove an
+            attribute from a place name.
+        """
+        newobj = PlaceName(name or self.name,
+                           kind or self.kind,
+                           suffix or self.suffix)
+
+        newobj.attr.update(self.attr)
+        if attr:
+            newobj.attr.update(attr)
+
+        return newobj
+
+
+    def set_attr(self, key: str, value: str) -> None:
+        """ Add the given property to the name. If the property was already
+            set, then the value is overwritten.
+        """
+        self.attr[key] = value
+
+
+    def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
+        """ Return the given property or the value of 'default' if it
+            is not set.
+        """
+        return self.attr.get(key, default)
+
+
+    def has_attr(self, key: str) -> bool:
+        """ Check if the given attribute is set.
+        """
+        return key in self.attr
diff --git a/nominatim/db/connection.py b/nominatim/db/connection.py

index 4f32dfceb8b56868f6c83c6bc6e8200875349773..86ead02c61267d9f552344b9e2461c1d258c0d47 100644 (file)
--- a/nominatim/db/connection.py
+++ b/nominatim/db/connection.py
@@ -63,7 +63,7 @@ class Cursor(psycopg2.extras.DictCursor):
  
      def drop_table(self, name: str, if_exists: bool = True, cascade: bool = False) -> None:
          """ Drop the table with the given name.
-            Set `if_exists` to False if a non-existant table should raise
+            Set `if_exists` to False if a non-existent table should raise
              an exception instead of just being ignored. If 'cascade' is set
              to True then all dependent tables are deleted as well.
          """
@@ -141,7 +141,7 @@ class Connection(psycopg2.extensions.connection):
  
      def drop_table(self, name: str, if_exists: bool = True, cascade: bool = False) -> None:
          """ Drop the table with the given name.
-            Set `if_exists` to False if a non-existant table should raise
+            Set `if_exists` to False if a non-existent table should raise
              an exception instead of just being ignored.
          """
          with self.cursor() as cur:
diff --git a/nominatim/db/properties.py b/nominatim/db/properties.py

index 3624c950e4a158c6b4f7d8b6ab7d8f9cfbd6911c..40cb262edf4e048592ac8cc45849b2edeff669b8 100644 (file)
--- a/nominatim/db/properties.py
+++ b/nominatim/db/properties.py
@@ -12,7 +12,7 @@ from typing import Optional, cast
  from nominatim.db.connection import Connection
  
  def set_property(conn: Connection, name: str, value: str) -> None:
-    """ Add or replace the propery with the given name.
+    """ Add or replace the property with the given name.
      """
      with conn.cursor() as cur:
          cur.execute('SELECT value FROM nominatim_properties WHERE property = %s',
diff --git a/nominatim/indexer/indexer.py b/nominatim/indexer/indexer.py

index 4f7675309cbaa91068f777a97789b5b2e809c5ac..5425c8ffaf212ada260b57932c0566c6611ac2c5 100644 (file)
--- a/nominatim/indexer/indexer.py
+++ b/nominatim/indexer/indexer.py
@@ -175,7 +175,7 @@ class Indexer:
  
  
      def index_postcodes(self) -> None:
-        """Index the entries ofthe location_postcode table.
+        """Index the entries of the location_postcode table.
          """
          LOG.warning("Starting indexing postcodes using %s threads", self.num_threads)
  
@@ -221,7 +221,7 @@ class Indexer:
                                  # asynchronously get the next batch
                                  has_more = fetcher.fetch_next_batch(cur, runner)
  
-                                # And insert the curent batch
+                                # And insert the current batch
                                  for idx in range(0, len(places), batch):
                                      part = places[idx:idx + batch]
                                      LOG.debug("Processing places: %s", str(part))
diff --git a/nominatim/indexer/progress.py b/nominatim/indexer/progress.py

index bc1d68a3c1c50a6f61ae93c313f9782f26d9135b..177c262b702e19eec30c08383737670c1414ce74 100644 (file)
--- a/nominatim/indexer/progress.py
+++ b/nominatim/indexer/progress.py
@@ -18,7 +18,7 @@ class ProgressLogger:
      """ Tracks and prints progress for the indexing process.
          `name` is the name of the indexing step being tracked.
          `total` sets up the total number of items that need processing.
-        `log_interval` denotes the interval in seconds at which progres
+        `log_interval` denotes the interval in seconds at which progress
          should be reported.
      """
  
diff --git a/nominatim/tokenizer/base.py b/nominatim/tokenizer/base.py

index dbc4cfadcefbe0df9a497afe4d44ec41fb3f913f..afbd1914b35d84219812afdd64f3061d306944f5 100644 (file)
--- a/nominatim/tokenizer/base.py
+++ b/nominatim/tokenizer/base.py
@@ -5,7 +5,7 @@
  # Copyright (C) 2022 by the Nominatim developer community.
  # For a full list of authors see the git log.
  """
-Abstract class defintions for tokenizers. These base classes are here
+Abstract class definitions for tokenizers. These base classes are here
  mainly for documentation purposes.
  """
  from abc import ABC, abstractmethod
@@ -113,7 +113,7 @@ class AbstractAnalyzer(ABC):
              the search index.
  
              Arguments:
-                place: Place information retrived from the database.
+                place: Place information retrieved from the database.
  
              Returns:
                  A JSON-serialisable structure that will be handed into
@@ -141,7 +141,7 @@ class AbstractTokenizer(ABC):
  
                init_db: When set to False, then initialisation of database
                  tables should be skipped. This option is only required for
-                migration purposes and can be savely ignored by custom
+                migration purposes and can be safely ignored by custom
                  tokenizers.
  
              TODO: can we move the init_db parameter somewhere else?
diff --git a/nominatim/tokenizer/factory.py b/nominatim/tokenizer/factory.py

index 67e221949911b19dcd10943d5833f5430194da1d..f5159fa00fb64b3c64a36a95fd855a9aab61b847 100644 (file)
--- a/nominatim/tokenizer/factory.py
+++ b/nominatim/tokenizer/factory.py
@@ -9,11 +9,11 @@ Functions for creating a tokenizer or initialising the right one for an
  existing database.
  
  A tokenizer is something that is bound to the lifetime of a database. It
-can be choosen and configured before the intial import but then needs to
+can be chosen and configured before the initial import but then needs to
  be used consistently when querying and updating the database.
  
  This module provides the functions to create and configure a new tokenizer
-as well as instanciating the appropriate tokenizer for updating an existing
+as well as instantiating the appropriate tokenizer for updating an existing
  database.
  
  A tokenizer usually also includes PHP code for querying. The appropriate PHP
diff --git a/nominatim/tokenizer/icu_rule_loader.py b/nominatim/tokenizer/icu_rule_loader.py

index 84040ddc36f86a1ad66722abfbcd4f444fff652a..4c36282ca54bfbd3526d24ead471a3e9fe9dbc33 100644 (file)
--- a/nominatim/tokenizer/icu_rule_loader.py
+++ b/nominatim/tokenizer/icu_rule_loader.py
@@ -8,18 +8,19 @@
  Helper class to create ICU rules from a configuration file.
  """
  from typing import Mapping, Any, Dict, Optional
-import importlib
  import io
  import json
  import logging
  
+from icu import Transliterator
+
  from nominatim.config import flatten_config_list, Configuration
  from nominatim.db.properties import set_property, get_property
  from nominatim.db.connection import Connection
  from nominatim.errors import UsageError
  from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
  from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
-from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyser
+from nominatim.tokenizer.token_analysis.base import AnalysisModule, Analyzer
  import nominatim.data.country_info
  
  LOG = logging.getLogger()
@@ -45,6 +46,7 @@ class ICURuleLoader:
      """
  
      def __init__(self, config: Configuration) -> None:
+        self.config = config
          rules = config.load_sub_configuration('icu_tokenizer.yaml',
                                                config='TOKENIZER_CONFIG')
  
@@ -92,7 +94,7 @@ class ICURuleLoader:
      def make_sanitizer(self) -> PlaceSanitizer:
          """ Create a place sanitizer from the configured rules.
          """
-        return PlaceSanitizer(self.sanitizer_rules)
+        return PlaceSanitizer(self.sanitizer_rules, self.config)
  
  
      def make_token_analysis(self) -> ICUTokenAnalysis:
@@ -135,6 +137,11 @@ class ICURuleLoader:
          if not isinstance(self.analysis_rules, list):
              raise UsageError("Configuration section 'token-analysis' must be a list.")
  
+        norm = Transliterator.createFromRules("rule_loader_normalization",
+                                              self.normalization_rules)
+        trans = Transliterator.createFromRules("rule_loader_transliteration",
+                                              self.transliteration_rules)
+
          for section in self.analysis_rules:
              name = section.get('id', None)
              if name in self.analysis:
@@ -144,7 +151,8 @@ class ICURuleLoader:
                      LOG.fatal("ICU tokenizer configuration has two token "
                                "analyzers with id '%s'.", name)
                  raise UsageError("Syntax error in ICU tokenizer config.")
-            self.analysis[name] = TokenAnalyzerRule(section, self.normalization_rules)
+            self.analysis[name] = TokenAnalyzerRule(section, norm, trans,
+                                                    self.config)
  
  
      @staticmethod
@@ -168,16 +176,21 @@ class TokenAnalyzerRule:
          and creates a new token analyzer on request.
      """
  
-    def __init__(self, rules: Mapping[str, Any], normalization_rules: str) -> None:
-        # Find the analysis module
-        module_name = 'nominatim.tokenizer.token_analysis.' \
-                      + _get_section(rules, 'analyzer').replace('-', '_')
-        self._analysis_mod: AnalysisModule = importlib.import_module(module_name)
+    def __init__(self, rules: Mapping[str, Any],
+                 normalizer: Any, transliterator: Any,
+                 config: Configuration) -> None:
+        analyzer_name = _get_section(rules, 'analyzer')
+        if not analyzer_name or not isinstance(analyzer_name, str):
+            raise UsageError("'analyzer' parameter needs to be simple string")
+
+        self._analysis_mod: AnalysisModule = \
+            config.load_plugin_module(analyzer_name, 'nominatim.tokenizer.token_analysis')
+
+        self.config = self._analysis_mod.configure(rules, normalizer,
+                                                   transliterator)
  
-        # Load the configuration.
-        self.config = self._analysis_mod.configure(rules, normalization_rules)
  
-    def create(self, normalizer: Any, transliterator: Any) -> Analyser:
+    def create(self, normalizer: Any, transliterator: Any) -> Analyzer:
          """ Create a new analyser instance for the given rule.
          """
          return self._analysis_mod.create(normalizer, transliterator, self.config)
diff --git a/nominatim/tokenizer/icu_token_analysis.py b/nominatim/tokenizer/icu_token_analysis.py

index 3c4d729885450c5907283f9f7801f2df01296973..7ea31e8ea1eb21b20f3e8dd8b3c3f12cdc7cee83 100644 (file)
--- a/nominatim/tokenizer/icu_token_analysis.py
+++ b/nominatim/tokenizer/icu_token_analysis.py
@@ -11,7 +11,7 @@ into a Nominatim token.
  from typing import Mapping, Optional, TYPE_CHECKING
  from icu import Transliterator
  
-from nominatim.tokenizer.token_analysis.base import Analyser
+from nominatim.tokenizer.token_analysis.base import Analyzer
  
  if TYPE_CHECKING:
      from typing import Any
@@ -19,7 +19,7 @@ if TYPE_CHECKING:
  
  class ICUTokenAnalysis:
      """ Container class collecting the transliterators and token analysis
-        modules for a single NameAnalyser instance.
+        modules for a single Analyser instance.
      """
  
      def __init__(self, norm_rules: str, trans_rules: str,
@@ -36,7 +36,7 @@ class ICUTokenAnalysis:
                           for name, arules in analysis_rules.items()}
  
  
-    def get_analyzer(self, name: Optional[str]) -> Analyser:
+    def get_analyzer(self, name: Optional[str]) -> Analyzer:
          """ Return the given named analyzer. If no analyzer with that
              name exists, return the default analyzer.
          """
diff --git a/nominatim/tokenizer/icu_tokenizer.py b/nominatim/tokenizer/icu_tokenizer.py

index 31eaaf2958aef1411a8228462fee04d68507555b..319838a16849b7bc9d1bdae31b27dab07594eb5a 100644 (file)
--- a/nominatim/tokenizer/icu_tokenizer.py
+++ b/nominatim/tokenizer/icu_tokenizer.py
@@ -23,7 +23,7 @@ from nominatim.db.sql_preprocessor import SQLPreprocessor
  from nominatim.data.place_info import PlaceInfo
  from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
  from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
-from nominatim.tokenizer.sanitizers.base import PlaceName
+from nominatim.data.place_name import PlaceName
  from nominatim.tokenizer.icu_token_analysis import ICUTokenAnalysis
  from nominatim.tokenizer.base import AbstractAnalyzer, AbstractTokenizer
  
@@ -38,7 +38,7 @@ def create(dsn: str, data_dir: Path) -> 'ICUTokenizer':
  
  
  class ICUTokenizer(AbstractTokenizer):
-    """ This tokenizer uses libICU to covert names and queries to ASCII.
+    """ This tokenizer uses libICU to convert names and queries to ASCII.
          Otherwise it uses the same algorithms and data structures as the
          normalization routines in Nominatim 3.
      """
@@ -324,7 +324,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
                              postcode_name = place.name.strip().upper()
                              variant_base = None
                          else:
-                            postcode_name = analyzer.normalize(place.name)
+                            postcode_name = analyzer.get_canonical_id(place)
                              variant_base = place.get_attr("variant")
  
                          if variant_base:
@@ -359,7 +359,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
                  if analyzer is None:
                      variants = [term]
                  else:
-                    variants = analyzer.get_variants_ascii(variant)
+                    variants = analyzer.compute_variants(variant)
                      if term not in variants:
                          variants.append(term)
              else:
@@ -431,7 +431,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
      def _remove_special_phrases(self, cursor: Cursor,
                               new_phrases: Set[Tuple[str, str, str, str]],
                               existing_phrases: Set[Tuple[str, str, str, str]]) -> int:
-        """ Remove all phrases from the databse that are no longer in the
+        """ Remove all phrases from the database that are no longer in the
              new phrase list.
          """
          to_delete = existing_phrases - new_phrases
@@ -573,17 +573,17 @@ class ICUNameAnalyzer(AbstractAnalyzer):
              # Otherwise use the analyzer to determine the canonical name.
              # Per convention we use the first variant as the 'lookup name', the
              # name that gets saved in the housenumber field of the place.
-            norm_name = analyzer.normalize(hnr.name)
-            if norm_name:
-                result = self._cache.housenumbers.get(norm_name, result)
+            word_id = analyzer.get_canonical_id(hnr)
+            if word_id:
+                result = self._cache.housenumbers.get(word_id, result)
                  if result[0] is None:
-                    variants = analyzer.get_variants_ascii(norm_name)
+                    variants = analyzer.compute_variants(word_id)
                      if variants:
                          with self.conn.cursor() as cur:
                              cur.execute("SELECT create_analyzed_hnr_id(%s, %s)",
-                                        (norm_name, list(variants)))
+                                        (word_id, list(variants)))
                              result = cur.fetchone()[0], variants[0] # type: ignore[no-untyped-call]
-                            self._cache.housenumbers[norm_name] = result
+                            self._cache.housenumbers[word_id] = result
  
          return result
  
@@ -620,7 +620,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
  
      def _retrieve_full_tokens(self, name: str) -> List[int]:
          """ Get the full name token for the given name, if it exists.
-            The name is only retrived for the standard analyser.
+            The name is only retrieved for the standard analyser.
          """
          assert self.conn is not None
          norm_name = self._search_normalized(name)
@@ -650,15 +650,15 @@ class ICUNameAnalyzer(AbstractAnalyzer):
          for name in names:
              analyzer_id = name.get_attr('analyzer')
              analyzer = self.token_analysis.get_analyzer(analyzer_id)
-            norm_name = analyzer.normalize(name.name)
+            word_id = analyzer.get_canonical_id(name)
              if analyzer_id is None:
-                token_id = norm_name
+                token_id = word_id
              else:
-                token_id = f'{norm_name}@{analyzer_id}'
+                token_id = f'{word_id}@{analyzer_id}'
  
              full, part = self._cache.names.get(token_id, (None, None))
              if full is None:
-                variants = analyzer.get_variants_ascii(norm_name)
+                variants = analyzer.compute_variants(word_id)
                  if not variants:
                      continue
  
@@ -688,7 +688,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
              postcode_name = item.name.strip().upper()
              variant_base = None
          else:
-            postcode_name = analyzer.normalize(item.name)
+            postcode_name = analyzer.get_canonical_id(item)
              variant_base = item.get_attr("variant")
  
          if variant_base:
@@ -703,7 +703,7 @@ class ICUNameAnalyzer(AbstractAnalyzer):
  
              variants = {term}
              if analyzer is not None and variant_base:
-                variants.update(analyzer.get_variants_ascii(variant_base))
+                variants.update(analyzer.compute_variants(variant_base))
  
              with self.conn.cursor() as cur:
                  cur.execute("SELECT create_postcode_word(%s, %s)",
diff --git a/nominatim/tokenizer/place_sanitizer.py b/nominatim/tokenizer/place_sanitizer.py

index 3f548e061e4fdc9dc45e5f4711c35b061d66a1cd..2f76fe344a518c96ca030edf6610a5088844e724 100644 (file)
--- a/nominatim/tokenizer/place_sanitizer.py
+++ b/nominatim/tokenizer/place_sanitizer.py
@@ -9,11 +9,12 @@ Handler for cleaning name and address tags in place information before it
  is handed to the token analysis.
  """
  from typing import Optional, List, Mapping, Sequence, Callable, Any, Tuple
-import importlib
  
  from nominatim.errors import UsageError
+from nominatim.config import Configuration
  from nominatim.tokenizer.sanitizers.config import SanitizerConfig
-from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import SanitizerHandler, ProcessInfo
+from nominatim.data.place_name import PlaceName
  from nominatim.data.place_info import PlaceInfo
  
  
@@ -22,16 +23,21 @@ class PlaceSanitizer:
          names and address before they are used by the token analysers.
      """
  
-    def __init__(self, rules: Optional[Sequence[Mapping[str, Any]]]) -> None:
+    def __init__(self, rules: Optional[Sequence[Mapping[str, Any]]],
+                 config: Configuration) -> None:
          self.handlers: List[Callable[[ProcessInfo], None]] = []
  
          if rules:
              for func in rules:
                  if 'step' not in func:
                      raise UsageError("Sanitizer rule is missing the 'step' attribute.")
-                module_name = 'nominatim.tokenizer.sanitizers.' + func['step'].replace('-', '_')
-                handler_module: SanitizerHandler = importlib.import_module(module_name)
-                self.handlers.append(handler_module.create(SanitizerConfig(func)))
+                if not isinstance(func['step'], str):
+                    raise UsageError("'step' attribute must be a simple string.")
+
+                module: SanitizerHandler = \
+                    config.load_plugin_module(func['step'], 'nominatim.tokenizer.sanitizers')
+
+                self.handlers.append(module.create(SanitizerConfig(func)))
  
  
      def process_names(self, place: PlaceInfo) -> Tuple[List[PlaceName], List[PlaceName]]:
diff --git a/nominatim/tokenizer/sanitizers/base.py b/nominatim/tokenizer/sanitizers/base.py

index 692c6d5ffe8450d573bfbb7cfb385feccc0854e3..2de868c787cb6f47d75039a1b239ab385f9853af 100644 (file)
--- a/nominatim/tokenizer/sanitizers/base.py
+++ b/nominatim/tokenizer/sanitizers/base.py
@@ -7,74 +7,13 @@
  """
  Common data types and protocols for sanitizers.
  """
-from typing import Optional, Dict, List, Mapping, Callable
+from typing import Optional, List, Mapping, Callable
  
  from nominatim.tokenizer.sanitizers.config import SanitizerConfig
  from nominatim.data.place_info import PlaceInfo
+from nominatim.data.place_name import PlaceName
  from nominatim.typing import Protocol, Final
  
-class PlaceName:
-    """ A searchable name for a place together with properties.
-        Every name object saves the name proper and two basic properties:
-        * 'kind' describes the name of the OSM key used without any suffixes
-          (i.e. the part after the colon removed)
-        * 'suffix' contains the suffix of the OSM tag, if any. The suffix
-          is the part of the key after the first colon.
-        In addition to that, the name may have arbitrary additional attributes.
-        Which attributes are used, depends on the token analyser.
-    """
-
-    def __init__(self, name: str, kind: str, suffix: Optional[str]):
-        self.name = name
-        self.kind = kind
-        self.suffix = suffix
-        self.attr: Dict[str, str] = {}
-
-
-    def __repr__(self) -> str:
-        return f"PlaceName(name='{self.name}',kind='{self.kind}',suffix='{self.suffix}')"
-
-
-    def clone(self, name: Optional[str] = None,
-              kind: Optional[str] = None,
-              suffix: Optional[str] = None,
-              attr: Optional[Mapping[str, str]] = None) -> 'PlaceName':
-        """ Create a deep copy of the place name, optionally with the
-            given parameters replaced. In the attribute list only the given
-            keys are updated. The list is not replaced completely.
-            In particular, the function cannot to be used to remove an
-            attribute from a place name.
-        """
-        newobj = PlaceName(name or self.name,
-                           kind or self.kind,
-                           suffix or self.suffix)
-
-        newobj.attr.update(self.attr)
-        if attr:
-            newobj.attr.update(attr)
-
-        return newobj
-
-
-    def set_attr(self, key: str, value: str) -> None:
-        """ Add the given property to the name. If the property was already
-            set, then the value is overwritten.
-        """
-        self.attr[key] = value
-
-
-    def get_attr(self, key: str, default: Optional[str] = None) -> Optional[str]:
-        """ Return the given property or the value of 'default' if it
-            is not set.
-        """
-        return self.attr.get(key, default)
-
-
-    def has_attr(self, key: str) -> bool:
-        """ Check if the given attribute is set.
-        """
-        return key in self.attr
-
  
  class ProcessInfo:
      """ Container class for information handed into to handler functions.
@@ -113,7 +52,13 @@ class SanitizerHandler(Protocol):
  
      def create(self, config: SanitizerConfig) -> Callable[[ProcessInfo], None]:
          """
-        A sanitizer must define a single function `create`. It takes the
-        dictionary with the configuration information for the sanitizer and
-        returns a function that transforms name and address.
+        Create a function for sanitizing a place.
+
+        Arguments:
+            config: A dictionary with the additional configuration options
+                    specified in the tokenizer configuration
+
+        Return:
+            The result must be a callable that takes a place description
+            and transforms name and address as reuqired.
          """
diff --git a/nominatim/tokenizer/sanitizers/clean_housenumbers.py b/nominatim/tokenizer/sanitizers/clean_housenumbers.py

index 5df057d0506a7d4950c5e4db7291b7d7f45dc76f..417d68d2025777b944d1944371dea3d9268c0616 100644 (file)
--- a/nominatim/tokenizer/sanitizers/clean_housenumbers.py
+++ b/nominatim/tokenizer/sanitizers/clean_housenumbers.py
@@ -27,7 +27,8 @@ Arguments:
  from typing import Callable, Iterator, List
  import re
  
-from nominatim.tokenizer.sanitizers.base import ProcessInfo, PlaceName
+from nominatim.tokenizer.sanitizers.base import ProcessInfo
+from nominatim.data.place_name import PlaceName
  from nominatim.tokenizer.sanitizers.config import SanitizerConfig
  
  class _HousenumberSanitizer:
diff --git a/nominatim/tokenizer/sanitizers/clean_postcodes.py b/nominatim/tokenizer/sanitizers/clean_postcodes.py

index cabacff41ee5810f587c3c2cb055c1c995ea81e8..593f770db9ade7b1e9d8915d6de85d19003d63b5 100644 (file)
--- a/nominatim/tokenizer/sanitizers/clean_postcodes.py
+++ b/nominatim/tokenizer/sanitizers/clean_postcodes.py
@@ -59,7 +59,7 @@ class _PostcodeSanitizer:
      def scan(self, postcode: str, country: Optional[str]) -> Optional[Tuple[str, str]]:
          """ Check the postcode for correct formatting and return the
              normalized version. Returns None if the postcode does not
-            correspond to the oficial format of the given country.
+            correspond to the official format of the given country.
          """
          match = self.matcher.match(country, postcode)
          if match is None:
diff --git a/nominatim/tokenizer/sanitizers/config.py b/nominatim/tokenizer/sanitizers/config.py

index fd05848b9c1420a1b1099ceaa209c130dc48333c..8b9164c6b81497863e1cffbc4878ad59c940d6c8 100644 (file)
--- a/nominatim/tokenizer/sanitizers/config.py
+++ b/nominatim/tokenizer/sanitizers/config.py
@@ -21,19 +21,25 @@ else:
      _BaseUserDict = UserDict
  
  class SanitizerConfig(_BaseUserDict):
-    """ Dictionary with configuration options for a sanitizer.
-
-        In addition to the usual dictionary function, the class provides
-        accessors to standard sanatizer options that are used by many of the
+    """ The `SanitizerConfig` class is a read-only dictionary
+        with configuration options for the sanitizer.
+        In addition to the usual dictionary functions, the class provides
+        accessors to standard sanitizer options that are used by many of the
          sanitizers.
      """
  
      def get_string_list(self, param: str, default: Sequence[str] = tuple()) -> Sequence[str]:
          """ Extract a configuration parameter as a string list.
-            If the parameter value is a simple string, it is returned as a
-            one-item list. If the parameter value does not exist, the given
-            default is returned. If the parameter value is a list, it is checked
-            to contain only strings before being returned.
+
+            Arguments:
+                param: Name of the configuration parameter.
+                default: Value to return, when the parameter is missing.
+
+            Returns:
+                If the parameter value is a simple string, it is returned as a
+                one-item list. If the parameter value does not exist, the given
+                default is returned. If the parameter value is a list, it is
+                checked to contain only strings before being returned.
          """
          values = self.data.get(param, None)
  
@@ -54,9 +60,16 @@ class SanitizerConfig(_BaseUserDict):
  
      def get_bool(self, param: str, default: Optional[bool] = None) -> bool:
          """ Extract a configuration parameter as a boolean.
-            The parameter must be one of the yaml boolean values or an
-            user error will be raised. If `default` is given, then the parameter
-            may also be missing or empty.
+
+            Arguments:
+                param: Name of the configuration parameter. The parameter must
+                       contain one of the yaml boolean values or an
+                       UsageError will be raised.
+                default: Value to return, when the parameter is missing.
+                         When set to `None`, the parameter must be defined.
+
+            Returns:
+                Boolean value of the given parameter.
          """
          value = self.data.get(param, default)
  
@@ -67,15 +80,20 @@ class SanitizerConfig(_BaseUserDict):
  
  
      def get_delimiter(self, default: str = ',;') -> Pattern[str]:
-        """ Return the 'delimiter' parameter in the configuration as a
-            compiled regular expression that can be used to split the names on the
-            delimiters. The regular expression makes sure that the resulting names
-            are stripped and that repeated delimiters
-            are ignored but it will still create empty fields on occasion. The
-            code needs to filter those.
-
-            The 'default' parameter defines the delimiter set to be used when
-            not explicitly configured.
+        """ Return the 'delimiters' parameter in the configuration as a
+            compiled regular expression that can be used to split strings on
+            these delimiters.
+
+            Arguments:
+                default: Delimiters to be used when 'delimiters' parameter
+                         is not explicitly configured.
+
+            Returns:
+                A regular expression pattern which can be used to
+                split a string. The regular expression makes sure that the
+                resulting names are stripped and that repeated delimiters
+                are ignored. It may still create empty fields on occasion. The
+                code needs to filter those.
          """
          delimiter_set = set(self.data.get('delimiters', default))
          if not delimiter_set:
@@ -86,13 +104,22 @@ class SanitizerConfig(_BaseUserDict):
  
      def get_filter_kind(self, *default: str) -> Callable[[str], bool]:
          """ Return a filter function for the name kind from the 'filter-kind'
-            config parameter. The filter functions takes a name item and returns
-            True when the item passes the filter.
+            config parameter.
  
-            If the parameter is empty, the filter lets all items pass. If the
-            paramter is a string, it is interpreted as a single regular expression
-            that must match the full kind string. If the parameter is a list then
+            If the 'filter-kind' parameter is empty, the filter lets all items
+            pass. If the parameter is a string, it is interpreted as a single
+            regular expression that must match the full kind string.
+            If the parameter is a list then
              any of the regular expressions in the list must match to pass.
+
+            Arguments:
+                default: Filters to be used, when the 'filter-kind' parameter
+                         is not specified. If omitted then the default is to
+                         let all names pass.
+
+            Returns:
+                A filter function which takes a name string and returns
+                True when the item passes the filter.
          """
          filters = self.get_string_list('filter-kind', default)
  
diff --git a/nominatim/tokenizer/token_analysis/base.py b/nominatim/tokenizer/token_analysis/base.py

index b2a4386cb6d2ea640d7fa25053c4b9b74f987686..68046f9621306b0341366702ce81b43b640e922e 100644 (file)
--- a/nominatim/tokenizer/token_analysis/base.py
+++ b/nominatim/tokenizer/token_analysis/base.py
@@ -10,33 +10,87 @@ Common data types and protocols for analysers.
  from typing import Mapping, List, Any
  
  from nominatim.typing import Protocol
+from nominatim.data.place_name import PlaceName
  
-class Analyser(Protocol):
-    """ Instance of the token analyser.
+class Analyzer(Protocol):
+    """ The `create()` function of an analysis module needs to return an
+        object that implements the following functions.
      """
  
-    def normalize(self, name: str) -> str:
-        """ Return the normalized form of the name. This is the standard form
-            from which possible variants for the name can be derived.
+    def get_canonical_id(self, name: PlaceName) -> str:
+        """ Return the canonical form of the given name. The canonical ID must
+            be unique (the same ID must always yield the same variants) and
+            must be a form from which the variants can be derived.
+
+            Arguments:
+                name: Extended place name description as prepared by
+                      the sanitizers.
+
+            Returns:
+                ID string with a canonical form of the name. The string may
+                be empty, when the analyzer cannot analyze the name at all,
+                for example because the character set in use does not match.
          """
  
-    def get_variants_ascii(self, norm_name: str) -> List[str]:
-        """ Compute the spelling variants for the given normalized name
-            and transliterate the result.
+    def compute_variants(self, canonical_id: str) -> List[str]:
+        """ Compute the transliterated spelling variants for the given
+            canonical ID.
+
+            Arguments:
+                canonical_id: ID string previously computed with
+                              `get_canonical_id()`.
+
+            Returns:
+                A list of possible spelling variants. All strings must have
+                been transformed with the global normalizer and
+                transliterator ICU rules. Otherwise they cannot be matched
+                against the input by the query frontend.
+                The list may be empty, when there are no useful
+                spelling variants. This may happen when an analyzer only
+                usually outputs additional variants to the canonical spelling
+                and there are no such variants.
          """
  
+
  class AnalysisModule(Protocol):
-    """ Protocol for analysis modules.
+    """ The setup of the token analysis is split into two parts:
+        configuration and analyser factory. A token analysis module must
+        therefore implement the two functions here described.
      """
  
-    def configure(self, rules: Mapping[str, Any], normalization_rules: str) -> Any:
+    def configure(self, rules: Mapping[str, Any],
+                  normalizer: Any, transliterator: Any) -> Any:
          """ Prepare the configuration of the analysis module.
              This function should prepare all data that can be shared
              between instances of this analyser.
+
+            Arguments:
+                rules: A dictionary with the additional configuration options
+                       as specified in the tokenizer configuration.
+                normalizer: an ICU Transliterator with the compiled
+                            global normalization rules.
+                transliterator: an ICU Transliterator with the compiled
+                                global transliteration rules.
+
+            Returns:
+                A data object with configuration data. This will be handed
+                as is into the `create()` function and may be
+                used freely by the analysis module as needed.
          """
  
-    def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyser:
+    def create(self, normalizer: Any, transliterator: Any, config: Any) -> Analyzer:
          """ Create a new instance of the analyser.
              A separate instance of the analyser is created for each thread
              when used in multi-threading context.
+
+            Arguments:
+                normalizer: an ICU Transliterator with the compiled normalization
+                            rules.
+                transliterator: an ICU Transliterator with the compiled
+                                transliteration rules.
+                config: The object that was returned by the call to configure().
+
+            Returns:
+                A new analyzer instance. This must be an object that implements
+                the Analyzer protocol.
          """
diff --git a/nominatim/tokenizer/token_analysis/config_variants.py b/nominatim/tokenizer/token_analysis/config_variants.py

index e0d1579d7fab880a94e40e4f07eb4fc654596e57..1258373eea9230ff3552e243ae726f4c0a4b2b2b 100644 (file)
--- a/nominatim/tokenizer/token_analysis/config_variants.py
+++ b/nominatim/tokenizer/token_analysis/config_variants.py
@@ -12,8 +12,6 @@ from collections import defaultdict
  import itertools
  import re
  
-from icu import Transliterator
-
  from nominatim.config import flatten_config_list
  from nominatim.errors import UsageError
  
@@ -25,7 +23,7 @@ class ICUVariant(NamedTuple):
  
  
  def get_variant_config(in_rules: Any,
-                       normalization_rules: str) -> Tuple[List[Tuple[str, List[str]]], str]:
+                       normalizer: Any) -> Tuple[List[Tuple[str, List[str]]], str]:
      """ Convert the variant definition from the configuration into
          replacement sets.
  
@@ -39,7 +37,7 @@ def get_variant_config(in_rules: Any,
          vset: Set[ICUVariant] = set()
          rules = flatten_config_list(in_rules, 'variants')
  
-        vmaker = _VariantMaker(normalization_rules)
+        vmaker = _VariantMaker(normalizer)
  
          for section in rules:
              for rule in (section.get('words') or []):
@@ -58,14 +56,13 @@ def get_variant_config(in_rules: Any,
  
  
  class _VariantMaker:
-    """ Generater for all necessary ICUVariants from a single variant rule.
+    """ Generator for all necessary ICUVariants from a single variant rule.
  
          All text in rules is normalized to make sure the variants match later.
      """
  
-    def __init__(self, norm_rules: Any) -> None:
-        self.norm = Transliterator.createFromRules("rule_loader_normalization",
-                                                   norm_rules)
+    def __init__(self, normalizer: Any) -> None:
+        self.norm = normalizer
  
  
      def compute(self, rule: Any) -> Iterator[ICUVariant]:
diff --git a/nominatim/tokenizer/token_analysis/generic.py b/nominatim/tokenizer/token_analysis/generic.py

index e14f844c5d3ff969502e014d41a67ef35ef0378c..1ed9bf4d383107e0c00a071d3f768057499f432e 100644 (file)
--- a/nominatim/tokenizer/token_analysis/generic.py
+++ b/nominatim/tokenizer/token_analysis/generic.py
@@ -13,18 +13,19 @@ import itertools
  import datrie
  
  from nominatim.errors import UsageError
+from nominatim.data.place_name import PlaceName
  from nominatim.tokenizer.token_analysis.config_variants import get_variant_config
  from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
  
  ### Configuration section
  
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> Dict[str, Any]:
+def configure(rules: Mapping[str, Any], normalizer: Any, _: Any) -> Dict[str, Any]:
      """ Extract and preprocess the configuration for this module.
      """
      config: Dict[str, Any] = {}
  
      config['replacements'], config['chars'] = get_variant_config(rules.get('variants'),
-                                                                 normalization_rules)
+                                                                 normalizer)
      config['variant_only'] = rules.get('mode', '') == 'variant-only'
  
      # parse mutation rules
@@ -77,14 +78,14 @@ class GenericTokenAnalysis:
          self.mutations = [MutationVariantGenerator(*cfg) for cfg in config['mutations']]
  
  
-    def normalize(self, name: str) -> str:
+    def get_canonical_id(self, name: PlaceName) -> str:
          """ Return the normalized form of the name. This is the standard form
              from which possible variants for the name can be derived.
          """
-        return cast(str, self.norm.transliterate(name)).strip()
+        return cast(str, self.norm.transliterate(name.name)).strip()
  
  
-    def get_variants_ascii(self, norm_name: str) -> List[str]:
+    def compute_variants(self, norm_name: str) -> List[str]:
          """ Compute the spelling variants for the given normalized name
              and transliterate the result.
          """
diff --git a/nominatim/tokenizer/token_analysis/generic_mutation.py b/nominatim/tokenizer/token_analysis/generic_mutation.py

index 47154537d0928d284aaf3482ea80a53dc028d9f2..612f558a46ae290fd383d66ecf4a34643d478836 100644 (file)
--- a/nominatim/tokenizer/token_analysis/generic_mutation.py
+++ b/nominatim/tokenizer/token_analysis/generic_mutation.py
@@ -23,7 +23,7 @@ def _zigzag(outer: Iterable[str], inner: Iterable[str]) -> Iterator[str]:
  class MutationVariantGenerator:
      """ Generates name variants by applying a regular expression to the name
          and replacing it with one or more variants. When the regular expression
-        matches more than once, each occurence is replaced with all replacement
+        matches more than once, each occurrence is replaced with all replacement
          patterns.
      """
  
diff --git a/nominatim/tokenizer/token_analysis/housenumbers.py b/nominatim/tokenizer/token_analysis/housenumbers.py

index a0f4214d55fee1b6862541409b7e2f6bab434b26..a8ad3ecb3658d6f74a55caa305160477f081159e 100644 (file)
--- a/nominatim/tokenizer/token_analysis/housenumbers.py
+++ b/nominatim/tokenizer/token_analysis/housenumbers.py
@@ -8,9 +8,10 @@
  Specialized processor for housenumbers. Analyses common housenumber patterns
  and creates variants for them.
  """
-from typing import Mapping, Any, List, cast
+from typing import Any, List, cast
  import re
  
+from nominatim.data.place_name import PlaceName
  from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
  
  RE_NON_DIGIT = re.compile('[^0-9]')
@@ -20,7 +21,7 @@ RE_NAMED_PART = re.compile(r'[a-z]{4}')
  
  ### Configuration section
  
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
      """ All behaviour is currently hard-coded.
      """
      return None
@@ -42,14 +43,14 @@ class HousenumberTokenAnalysis:
  
          self.mutator = MutationVariantGenerator('␣', (' ', ''))
  
-    def normalize(self, name: str) -> str:
+    def get_canonical_id(self, name: PlaceName) -> str:
          """ Return the normalized form of the housenumber.
          """
          # shortcut for number-only numbers, which make up 90% of the data.
-        if RE_NON_DIGIT.search(name) is None:
-            return name
+        if RE_NON_DIGIT.search(name.name) is None:
+            return name.name
  
-        norm = cast(str, self.trans.transliterate(self.norm.transliterate(name)))
+        norm = cast(str, self.trans.transliterate(self.norm.transliterate(name.name)))
          # If there is a significant non-numeric part, use as is.
          if RE_NAMED_PART.search(norm) is None:
              # Otherwise add optional spaces between digits and letters.
@@ -61,7 +62,7 @@ class HousenumberTokenAnalysis:
  
          return norm
  
-    def get_variants_ascii(self, norm_name: str) -> List[str]:
+    def compute_variants(self, norm_name: str) -> List[str]:
          """ Compute the spelling variants for the given normalized housenumber.
  
              Generates variants for optional spaces (marked with '␣').
diff --git a/nominatim/tokenizer/token_analysis/postcodes.py b/nominatim/tokenizer/token_analysis/postcodes.py

index 15b20bf915b3f48ba462e55c0441acf18038ceb5..94e936459c0e12ad18aaf8025f73e4468e8ec8d1 100644 (file)
--- a/nominatim/tokenizer/token_analysis/postcodes.py
+++ b/nominatim/tokenizer/token_analysis/postcodes.py
@@ -8,13 +8,14 @@
  Specialized processor for postcodes. Supports a 'lookup' variant of the
  token, which produces variants with optional spaces.
  """
-from typing import Mapping, Any, List
+from typing import Any, List
  
  from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
+from nominatim.data.place_name import PlaceName
  
  ### Configuration section
  
-def configure(rules: Mapping[str, Any], normalization_rules: str) -> None: # pylint: disable=W0613
+def configure(*_: Any) -> None:
      """ All behaviour is currently hard-coded.
      """
      return None
@@ -31,10 +32,8 @@ class PostcodeTokenAnalysis:
      """ Special normalization and variant generation for postcodes.
  
          This analyser must not be used with anything but postcodes as
-        it follows some special rules: `normalize` doesn't necessarily
-        need to return a standard form as per normalization rules. It
-        needs to return the canonical form of the postcode that is also
-        used for output. `get_variants_ascii` then needs to ensure that
+        it follows some special rules: the canonial ID is the form that
+        is used for the output. `compute_variants` then needs to ensure that
          the generated variants once more follow the standard normalization
          and transliteration, so that postcodes are correctly recognised by
          the search algorithm.
@@ -46,13 +45,13 @@ class PostcodeTokenAnalysis:
          self.mutator = MutationVariantGenerator(' ', (' ', ''))
  
  
-    def normalize(self, name: str) -> str:
+    def get_canonical_id(self, name: PlaceName) -> str:
          """ Return the standard form of the postcode.
          """
-        return name.strip().upper()
+        return name.name.strip().upper()
  
  
-    def get_variants_ascii(self, norm_name: str) -> List[str]:
+    def compute_variants(self, norm_name: str) -> List[str]:
          """ Compute the spelling variants for the given normalized postcode.
  
              Takes the canonical form of the postcode, normalizes it using the
diff --git a/nominatim/tools/check_database.py b/nominatim/tools/check_database.py

index e5cefe4f9c2ef6ddb944d262d5df89f64a14fa98..7372a49fd215564623c9c47bd3df83001f6edfe8 100644 (file)
--- a/nominatim/tools/check_database.py
+++ b/nominatim/tools/check_database.py
@@ -163,7 +163,7 @@ def check_placex_table(conn: Connection, config: Configuration) -> CheckResult:
      return CheckState.FATAL, dict(config=config)
  
  
-@_check(hint="""placex table has no data. Did the import finish sucessfully?""")
+@_check(hint="""placex table has no data. Did the import finish successfully?""")
  def check_placex_size(conn: Connection, _: Configuration) -> CheckResult:
      """ Checking for placex content
      """
@@ -181,7 +181,7 @@ def check_tokenizer(_: Connection, config: Configuration) -> CheckResult:
          tokenizer = tokenizer_factory.get_tokenizer_for_db(config)
      except UsageError:
          return CheckState.FAIL, dict(msg="""\
-            Cannot load tokenizer. Did the import finish sucessfully?""")
+            Cannot load tokenizer. Did the import finish successfully?""")
  
      result = tokenizer.check_database(config)
  
diff --git a/nominatim/tools/migration.py b/nominatim/tools/migration.py

index aa86bcc83288fd22914982819dec57d8c6bbada1..7854154c92c3e7ffb0746775fbe9f852a8bde0de 100644 (file)
--- a/nominatim/tools/migration.py
+++ b/nominatim/tools/migration.py
@@ -53,7 +53,7 @@ def migrate(config: Configuration, paths: Any) -> int:
          for version, func in _MIGRATION_FUNCTIONS:
              if db_version <= version:
                  title = func.__doc__ or ''
-                LOG.warning("Runnning: %s (%s)", title.split('\n', 1)[0],
+                LOG.warning("Running: %s (%s)", title.split('\n', 1)[0],
                              version_str(version))
                  kwargs = dict(conn=conn, config=config, paths=paths)
                  func(**kwargs)
@@ -241,7 +241,7 @@ def create_interpolation_index_on_place(conn: Connection, **_: Any) -> None:
  def add_step_column_for_interpolation(conn: Connection, **_: Any) -> None:
      """ Add a new column 'step' to the interpolations table.
  
-        Also convers the data into the stricter format which requires that
+        Also converts the data into the stricter format which requires that
          startnumbers comply with the odd/even requirements.
      """
      if conn.table_has_column('location_property_osmline', 'step'):
diff --git a/nominatim/tools/refresh.py b/nominatim/tools/refresh.py

index 9c5b7b085e50582202a117528d87bc0ca7ff117a..8c1e9d9bbb24882601e4d0d8bed6c3c54a41f498 100644 (file)
--- a/nominatim/tools/refresh.py
+++ b/nominatim/tools/refresh.py
@@ -126,7 +126,7 @@ PHP_CONST_DEFS = (
  def import_wikipedia_articles(dsn: str, data_path: Path, ignore_errors: bool = False) -> int:
      """ Replaces the wikipedia importance tables with new data.
          The import is run in a single transaction so that the new data
-        is replace seemlessly.
+        is replace seamlessly.
  
          Returns 0 if all was well and 1 if the importance file could not
          be found. Throws an exception if there was an error reading the file.
diff --git a/nominatim/tools/special_phrases/sp_importer.py b/nominatim/tools/special_phrases/sp_importer.py

index 8906e03e2c6ce6d1def0276802b5aed30154ad1b..06b59fd003d5e3022b3e32cd458936152aaeb669 100644 (file)
--- a/nominatim/tools/special_phrases/sp_importer.py
+++ b/nominatim/tools/special_phrases/sp_importer.py
@@ -59,7 +59,7 @@ class SPImporter():
          self.black_list, self.white_list = self._load_white_and_black_lists()
          self.sanity_check_pattern = re.compile(r'^\w+$')
          # This set will contain all existing phrases to be added.
-        # It contains tuples with the following format: (lable, class, type, operator)
+        # It contains tuples with the following format: (label, class, type, operator)
          self.word_phrases: Set[Tuple[str, str, str, str]] = set()
          # This set will contain all existing place_classtype tables which doesn't match any
          # special phrases class/type on the wiki.
diff --git a/nominatim/typing.py b/nominatim/typing.py

index 308f3e6a2cbb515245e81aa47a739ce562b1c04e..7914d73171a158474f0c5a993db3a4fb0d51424e 100644 (file)
--- a/nominatim/typing.py
+++ b/nominatim/typing.py
@@ -11,7 +11,7 @@ Complex type definitions are moved here, to keep the source files readable.
  """
  from typing import Any, Union, Mapping, TypeVar, Sequence, TYPE_CHECKING
  
-# Generics varaible names do not confirm to naming styles, ignore globally here.
+# Generics variable names do not confirm to naming styles, ignore globally here.
  # pylint: disable=invalid-name,abstract-method,multiple-statements
  # pylint: disable=missing-class-docstring,useless-import-alias
  
diff --git a/nominatim/version.py b/nominatim/version.py

index f950b8efbe734a2c4af9d058dd8caeb72bba4780..08cd574dfc75305681c583b197d0a7bf012e8182 100644 (file)
--- a/nominatim/version.py
+++ b/nominatim/version.py
@@ -30,7 +30,7 @@ NOMINATIM_VERSION = (4, 0, 99, 6)
  POSTGRESQL_REQUIRED_VERSION = (9, 5)
  POSTGIS_REQUIRED_VERSION = (2, 2)
  
-# Cmake sets a variabe @GIT_HASH@ by executing 'git --log'. It is not run
+# Cmake sets a variable @GIT_HASH@ by executing 'git --log'. It is not run
  # on every execution of 'make'.
  # cmake/tool-installed.tmpl is used to build the binary 'nominatim'. Inside
  # there is a call to set the variable value below.
diff --git a/test/bdd/api/search/params.feature b/test/bdd/api/search/params.feature

index 3f12f1c8f563e1c0ec751c14ac5a0ccf35b1722b..300948a9a0765f2c40adea52c5d9dbd6a3587e90 100644 (file)
--- a/test/bdd/api/search/params.feature
+++ b/test/bdd/api/search/params.feature
@@ -368,3 +368,10 @@ Feature: Search queries
            | Triesenberg |
  
  
+    Scenario: Array parameters are ignored
+        When sending json search query "Vaduz" with address
+          | countrycodes[] | polygon_svg[] | limit[] | polygon_threshold[] |
+          | IT             | 1             | 3       | 3.4                 |
+        Then result addresses contain
+          | ID | country_code |
+          | 0  | li           |
diff --git a/test/php/Nominatim/ParameterParserTest.php b/test/php/Nominatim/ParameterParserTest.php

index 7381bdf84a9cca05f13a0aff3ce6f0804076d62b..82716d4de98d68ed29826621759c9d379d1701c4 100644 (file)
--- a/test/php/Nominatim/ParameterParserTest.php
+++ b/test/php/Nominatim/ParameterParserTest.php
@@ -137,9 +137,6 @@ class ParameterParserTest extends \PHPUnit\Framework\TestCase
  
      public function testGetSet()
      {
-        $this->expectException(\Exception::class);
-        $this->expectExceptionMessage("Parameter 'val3' must be one of: foo, bar");
-
          $oParams = new ParameterParser(array(
                                          'val1' => 'foo',
                                          'val2' => '',
@@ -151,7 +148,7 @@ class ParameterParserTest extends \PHPUnit\Framework\TestCase
          $this->assertSame('foo', $oParams->getSet('val1', array('foo', 'bar')));
  
          $this->assertSame(false, $oParams->getSet('val2', array('foo', 'bar')));
-        $oParams->getSet('val3', array('foo', 'bar'));
+        $this->assertSame(false, $oParams->getSet('val3', array('foo', 'bar')));
      }
  
  
diff --git a/test/python/config/test_config_load_module.py b/test/python/config/test_config_load_module.py

new file mode 100644 (file)

index 0000000..df6c479
--- /dev/null
+++ b/test/python/config/test_config_load_module.py
@@ -0,0 +1,81 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This file is part of Nominatim. (https://nominatim.org)
+#
+# Copyright (C) 2022 by the Nominatim developer community.
+# For a full list of authors see the git log.
+"""
+Test for loading extra Python modules.
+"""
+from pathlib import Path
+import sys
+
+import pytest
+
+from nominatim.config import Configuration
+
+@pytest.fixture
+def test_config(src_dir, tmp_path):
+    """ Create a configuration object with project and config directories
+        in a temporary directory.
+    """
+    (tmp_path / 'project').mkdir()
+    (tmp_path / 'config').mkdir()
+    conf = Configuration(tmp_path / 'project', src_dir / 'settings')
+    conf.config_dir = tmp_path / 'config'
+    return conf
+
+
+def test_load_default_module(test_config):
+    module = test_config.load_plugin_module('version', 'nominatim')
+
+    assert isinstance(module.NOMINATIM_VERSION, tuple)
+
+def test_load_default_module_with_hyphen(test_config):
+    module = test_config.load_plugin_module('place-info', 'nominatim.data')
+
+    assert isinstance(module.PlaceInfo, object)
+
+
+def test_load_plugin_module(test_config, tmp_path):
+    (tmp_path / 'project' / 'testpath').mkdir()
+    (tmp_path / 'project' / 'testpath' / 'mymod.py')\
+        .write_text("def my_test_function():\n  return 'gjwitlsSG42TG%'")
+
+    module = test_config.load_plugin_module('testpath/mymod.py', 'private.something')
+
+    assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+    # also test reloading module
+    (tmp_path / 'project' / 'testpath' / 'mymod.py')\
+        .write_text("def my_test_function():\n  return 'hjothjorhj'")
+
+    module = test_config.load_plugin_module('testpath/mymod.py', 'private.something')
+
+    assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+
+def test_load_external_library_module(test_config, tmp_path, monkeypatch):
+    MODULE_NAME = 'foogurenqodr4'
+    pythonpath = tmp_path / 'priv-python'
+    pythonpath.mkdir()
+    (pythonpath / MODULE_NAME).mkdir()
+    (pythonpath / MODULE_NAME / '__init__.py').write_text('')
+    (pythonpath / MODULE_NAME / 'tester.py')\
+        .write_text("def my_test_function():\n  return 'gjwitlsSG42TG%'")
+
+    monkeypatch.syspath_prepend(pythonpath)
+
+    module = test_config.load_plugin_module(f'{MODULE_NAME}.tester', 'private.something')
+
+    assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+    # also test reloading module
+    (pythonpath / MODULE_NAME / 'tester.py')\
+        .write_text("def my_test_function():\n  return 'dfigjreigj'")
+
+    module = test_config.load_plugin_module(f'{MODULE_NAME}.tester', 'private.something')
+
+    assert module.my_test_function() == 'gjwitlsSG42TG%'
+
+    del sys.modules[f'{MODULE_NAME}.tester']
diff --git a/test/python/db/test_connection.py b/test/python/db/test_connection.py

index ed0537c89697998e9b65f4e4f726749014adf6e8..dbba61093816bf7bde2d07f9013eb46eee0972c5 100644 (file)
--- a/test/python/db/test_connection.py
+++ b/test/python/db/test_connection.py
@@ -5,7 +5,7 @@
  # Copyright (C) 2022 by the Nominatim developer community.
  # For a full list of authors see the git log.
  """
-Tests for specialised conenction and cursor classes.
+Tests for specialised connection and cursor classes.
  """
  import pytest
  import psycopg2
diff --git a/test/python/tokenizer/sanitizers/test_clean_housenumbers.py b/test/python/tokenizer/sanitizers/test_clean_housenumbers.py

index 128e1201ed1c4b3cdfa9714375e632ba75e0c374..11a71a5fb3b992ff70b1c981aea22bf17ad96f3c 100644 (file)
--- a/test/python/tokenizer/sanitizers/test_clean_housenumbers.py
+++ b/test/python/tokenizer/sanitizers/test_clean_housenumbers.py
@@ -13,14 +13,14 @@ from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
  from nominatim.data.place_info import PlaceInfo
  
  @pytest.fixture
-def sanitize(request):
+def sanitize(request, def_config):
      sanitizer_args = {'step': 'clean-housenumbers'}
      for mark in request.node.iter_markers(name="sanitizer_params"):
          sanitizer_args.update({k.replace('_', '-') : v for k,v in mark.kwargs.items()})
  
      def _run(**kwargs):
          place = PlaceInfo({'address': kwargs})
-        _, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+        _, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
  
          return sorted([(p.kind, p.name) for p in address])
  
@@ -45,24 +45,24 @@ def test_filter_kind(sanitize):
  
  
  @pytest.mark.parametrize('number', ('6523', 'n/a', '4'))
-def test_convert_to_name_converted(number):
+def test_convert_to_name_converted(def_config, number):
      sanitizer_args = {'step': 'clean-housenumbers',
                        'convert-to-name': (r'\d+', 'n/a')}
  
      place = PlaceInfo({'address': {'housenumber': number}})
-    names, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+    names, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
  
      assert ('housenumber', number) in set((p.kind, p.name) for p in names)
      assert 'housenumber' not in set(p.kind for p in address)
  
  
  @pytest.mark.parametrize('number', ('a54', 'n.a', 'bow'))
-def test_convert_to_name_unconverted(number):
+def test_convert_to_name_unconverted(def_config, number):
      sanitizer_args = {'step': 'clean-housenumbers',
                        'convert-to-name': (r'\d+', 'n/a')}
  
      place = PlaceInfo({'address': {'housenumber': number}})
-    names, address = PlaceSanitizer([sanitizer_args]).process_names(place)
+    names, address = PlaceSanitizer([sanitizer_args], def_config).process_names(place)
  
      assert 'housenumber' not in set(p.kind for p in names)
      assert ('housenumber', number) in set((p.kind, p.name) for p in address)
diff --git a/test/python/tokenizer/sanitizers/test_clean_postcodes.py b/test/python/tokenizer/sanitizers/test_clean_postcodes.py

index 237527f1e44c7064cf8be6e2c3025353a1774ebd..f2c965ad9b1db0017864b5bbaec1677023b1d838 100644 (file)
--- a/test/python/tokenizer/sanitizers/test_clean_postcodes.py
+++ b/test/python/tokenizer/sanitizers/test_clean_postcodes.py
@@ -25,7 +25,7 @@ def sanitize(def_config, request):
          if country is not None:
              pi['country_code'] = country
  
-        _, address = PlaceSanitizer([sanitizer_args]).process_names(PlaceInfo(pi))
+        _, address = PlaceSanitizer([sanitizer_args], def_config).process_names(PlaceInfo(pi))
  
          return sorted([(p.kind, p.name) for p in address])
  
diff --git a/test/python/tokenizer/sanitizers/test_split_name_list.py b/test/python/tokenizer/sanitizers/test_split_name_list.py

index 67157fba2148a9e806dce756c68e797e82719cdb..9ca539d57cb529cac93283c813c3f0a7075900a8 100644 (file)
--- a/test/python/tokenizer/sanitizers/test_split_name_list.py
+++ b/test/python/tokenizer/sanitizers/test_split_name_list.py
@@ -14,58 +14,66 @@ from nominatim.data.place_info import PlaceInfo
  
  from nominatim.errors import UsageError
  
-def run_sanitizer_on(**kwargs):
-    place = PlaceInfo({'name': kwargs})
-    name, _ = PlaceSanitizer([{'step': 'split-name-list'}]).process_names(place)
+class TestSplitName:
  
-    return sorted([(p.name, p.kind, p.suffix) for p in name])
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
  
  
-def sanitize_with_delimiter(delimiter, name):
-    place = PlaceInfo({'name': {'name': name}})
-    san = PlaceSanitizer([{'step': 'split-name-list', 'delimiters': delimiter}])
-    name, _ = san.process_names(place)
+    def run_sanitizer_on(self, **kwargs):
+        place = PlaceInfo({'name': kwargs})
+        name, _ = PlaceSanitizer([{'step': 'split-name-list'}], self.config).process_names(place)
  
-    return sorted([p.name for p in name])
+        return sorted([(p.name, p.kind, p.suffix) for p in name])
  
  
-def test_simple():
-    assert run_sanitizer_on(name='ABC') == [('ABC', 'name', None)]
-    assert run_sanitizer_on(name='') == [('', 'name', None)]
+    def sanitize_with_delimiter(self, delimiter, name):
+        place = PlaceInfo({'name': {'name': name}})
+        san = PlaceSanitizer([{'step': 'split-name-list', 'delimiters': delimiter}],
+                             self.config)
+        name, _ = san.process_names(place)
  
+        return sorted([p.name for p in name])
  
-def test_splits():
-    assert run_sanitizer_on(name='A;B;C') == [('A', 'name', None),
-                                              ('B', 'name', None),
-                                              ('C', 'name', None)]
-    assert run_sanitizer_on(short_name=' House, boat ') == [('House', 'short_name', None),
-                                                            ('boat', 'short_name', None)]
  
+    def test_simple(self):
+        assert self.run_sanitizer_on(name='ABC') == [('ABC', 'name', None)]
+        assert self.run_sanitizer_on(name='') == [('', 'name', None)]
  
-def test_empty_fields():
-    assert run_sanitizer_on(name='A;;B') == [('A', 'name', None),
-                                             ('B', 'name', None)]
-    assert run_sanitizer_on(name='A; ,B') == [('A', 'name', None),
-                                              ('B', 'name', None)]
-    assert run_sanitizer_on(name=' ;B') == [('B', 'name', None)]
-    assert run_sanitizer_on(name='B,') == [('B', 'name', None)]
  
+    def test_splits(self):
+        assert self.run_sanitizer_on(name='A;B;C') == [('A', 'name', None),
+                                                       ('B', 'name', None),
+                                                       ('C', 'name', None)]
+        assert self.run_sanitizer_on(short_name=' House, boat ') == [('House', 'short_name', None),
+                                                                     ('boat', 'short_name', None)]
  
-def test_custom_delimiters():
-    assert sanitize_with_delimiter(':', '12:45,3') == ['12', '45,3']
-    assert sanitize_with_delimiter('\\', 'a;\\b!#@ \\') == ['a;', 'b!#@']
-    assert sanitize_with_delimiter('[]', 'foo[to]be') == ['be', 'foo', 'to']
-    assert sanitize_with_delimiter(' ', 'morning  sun') == ['morning', 'sun']
  
+    def test_empty_fields(self):
+        assert self.run_sanitizer_on(name='A;;B') == [('A', 'name', None),
+                                                      ('B', 'name', None)]
+        assert self.run_sanitizer_on(name='A; ,B') == [('A', 'name', None),
+                                                       ('B', 'name', None)]
+        assert self.run_sanitizer_on(name=' ;B') == [('B', 'name', None)]
+        assert self.run_sanitizer_on(name='B,') == [('B', 'name', None)]
  
-def test_empty_delimiter_set():
-    with pytest.raises(UsageError):
-        sanitize_with_delimiter('', 'abc')
  
+    def test_custom_delimiters(self):
+        assert self.sanitize_with_delimiter(':', '12:45,3') == ['12', '45,3']
+        assert self.sanitize_with_delimiter('\\', 'a;\\b!#@ \\') == ['a;', 'b!#@']
+        assert self.sanitize_with_delimiter('[]', 'foo[to]be') == ['be', 'foo', 'to']
+        assert self.sanitize_with_delimiter(' ', 'morning  sun') == ['morning', 'sun']
  
-def test_no_name_list():
+
+    def test_empty_delimiter_set(self):
+        with pytest.raises(UsageError):
+            self.sanitize_with_delimiter('', 'abc')
+
+
+def test_no_name_list(def_config):
      place = PlaceInfo({'address': {'housenumber': '3'}})
-    name, address = PlaceSanitizer([{'step': 'split-name-list'}]).process_names(place)
+    name, address = PlaceSanitizer([{'step': 'split-name-list'}], def_config).process_names(place)
  
      assert not name
      assert len(address) == 1
diff --git a/test/python/tokenizer/sanitizers/test_strip_brace_terms.py b/test/python/tokenizer/sanitizers/test_strip_brace_terms.py

index eb5543646595a96aa7183e7c813f54ef61813b6e..7fa0a018d7603bf6a80d163e8ec818fc1208bb2e 100644 (file)
--- a/test/python/tokenizer/sanitizers/test_strip_brace_terms.py
+++ b/test/python/tokenizer/sanitizers/test_strip_brace_terms.py
@@ -12,39 +12,45 @@ import pytest
  from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
  from nominatim.data.place_info import PlaceInfo
  
-def run_sanitizer_on(**kwargs):
-    place = PlaceInfo({'name': kwargs})
-    name, _ = PlaceSanitizer([{'step': 'strip-brace-terms'}]).process_names(place)
+class TestStripBrace:
  
-    return sorted([(p.name, p.kind, p.suffix) for p in name])
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
  
+    def run_sanitizer_on(self, **kwargs):
+        place = PlaceInfo({'name': kwargs})
+        name, _ = PlaceSanitizer([{'step': 'strip-brace-terms'}], self.config).process_names(place)
  
-def test_no_braces():
-    assert run_sanitizer_on(name='foo', ref='23') == [('23', 'ref', None),
-                                                      ('foo', 'name', None)]
+        return sorted([(p.name, p.kind, p.suffix) for p in name])
  
  
-def test_simple_braces():
-    assert run_sanitizer_on(name='Halle (Saale)', ref='3')\
-      == [('3', 'ref', None), ('Halle', 'name', None), ('Halle (Saale)', 'name', None)]
-    assert run_sanitizer_on(name='ack ( bar')\
-      == [('ack', 'name', None), ('ack ( bar', 'name', None)]
+    def test_no_braces(self):
+        assert self.run_sanitizer_on(name='foo', ref='23') == [('23', 'ref', None),
+                                                               ('foo', 'name', None)]
  
  
-def test_only_braces():
-    assert run_sanitizer_on(name='(maybe)') == [('(maybe)', 'name', None)]
+    def test_simple_braces(self):
+        assert self.run_sanitizer_on(name='Halle (Saale)', ref='3')\
+          == [('3', 'ref', None), ('Halle', 'name', None), ('Halle (Saale)', 'name', None)]
+        assert self.run_sanitizer_on(name='ack ( bar')\
+          == [('ack', 'name', None), ('ack ( bar', 'name', None)]
  
  
-def test_double_braces():
-    assert run_sanitizer_on(name='a((b))') == [('a', 'name', None),
-                                               ('a((b))', 'name', None)]
-    assert run_sanitizer_on(name='a (b) (c)') == [('a', 'name', None),
-                                                  ('a (b) (c)', 'name', None)]
+    def test_only_braces(self):
+        assert self.run_sanitizer_on(name='(maybe)') == [('(maybe)', 'name', None)]
  
  
-def test_no_names():
+    def test_double_braces(self):
+        assert self.run_sanitizer_on(name='a((b))') == [('a', 'name', None),
+                                                        ('a((b))', 'name', None)]
+        assert self.run_sanitizer_on(name='a (b) (c)') == [('a', 'name', None),
+                                                           ('a (b) (c)', 'name', None)]
+
+
+def test_no_names(def_config):
      place = PlaceInfo({'address': {'housenumber': '3'}})
-    name, address = PlaceSanitizer([{'step': 'strip-brace-terms'}]).process_names(place)
+    name, address = PlaceSanitizer([{'step': 'strip-brace-terms'}], def_config).process_names(place)
  
      assert not name
      assert len(address) == 1
diff --git a/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py b/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py

index 306b80273f7876b1babc0f4efef022fa8e258aaa..1feecf3f6341b32728ef03ea91ab1a672962bd47 100644 (file)
--- a/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py
+++ b/test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py
@@ -15,11 +15,16 @@ from nominatim.data.country_info import setup_country_config
  
  class TestWithDefaults:
  
-    @staticmethod
-    def run_sanitizer_on(country, **kwargs):
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
+
+
+    def run_sanitizer_on(self, country, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
                             'country_code': country})
-        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}]).process_names(place)
+        name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language'}],
+                                 self.config).process_names(place)
  
          return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
  
@@ -44,12 +49,17 @@ class TestWithDefaults:
  
  class TestFilterKind:
  
-    @staticmethod
-    def run_sanitizer_on(filt, **kwargs):
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
+
+
+    def run_sanitizer_on(self, filt, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
                             'country_code': 'de'})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
-                                   'filter-kind': filt}]).process_names(place)
+                                   'filter-kind': filt}],
+                                 self.config).process_names(place)
  
          return sorted([(p.name, p.kind, p.suffix, p.attr) for p in name])
  
@@ -94,14 +104,16 @@ class TestDefaultCountry:
      @pytest.fixture(autouse=True)
      def setup_country(self, def_config):
          setup_country_config(def_config)
+        self.config = def_config
+
  
-    @staticmethod
-    def run_sanitizer_append(mode,  country, **kwargs):
+    def run_sanitizer_append(self, mode,  country, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
                             'country_code': country})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
                                     'use-defaults': mode,
-                                   'mode': 'append'}]).process_names(place)
+                                   'mode': 'append'}],
+                                 self.config).process_names(place)
  
          assert all(isinstance(p.attr, dict) for p in name)
          assert all(len(p.attr) <= 1 for p in name)
@@ -111,13 +123,13 @@ class TestDefaultCountry:
          return sorted([(p.name, p.attr.get('analyzer', '')) for p in name])
  
  
-    @staticmethod
-    def run_sanitizer_replace(mode,  country, **kwargs):
+    def run_sanitizer_replace(self, mode,  country, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
                             'country_code': country})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
                                     'use-defaults': mode,
-                                   'mode': 'replace'}]).process_names(place)
+                                   'mode': 'replace'}],
+                                 self.config).process_names(place)
  
          assert all(isinstance(p.attr, dict) for p in name)
          assert all(len(p.attr) <= 1 for p in name)
@@ -131,7 +143,8 @@ class TestDefaultCountry:
          place = PlaceInfo({'name': {'name': 'something'}})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
                                     'use-defaults': 'all',
-                                   'mode': 'replace'}]).process_names(place)
+                                   'mode': 'replace'}],
+                                 self.config).process_names(place)
  
          assert len(name) == 1
          assert name[0].name == 'something'
@@ -199,14 +212,19 @@ class TestDefaultCountry:
  
  class TestCountryWithWhitelist:
  
-    @staticmethod
-    def run_sanitizer_on(mode,  country, **kwargs):
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
+
+
+    def run_sanitizer_on(self, mode,  country, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()},
                             'country_code': country})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
                                     'use-defaults': mode,
                                     'mode': 'replace',
-                                   'whitelist': ['de', 'fr', 'ru']}]).process_names(place)
+                                   'whitelist': ['de', 'fr', 'ru']}],
+                                 self.config).process_names(place)
  
          assert all(isinstance(p.attr, dict) for p in name)
          assert all(len(p.attr) <= 1 for p in name)
@@ -238,12 +256,17 @@ class TestCountryWithWhitelist:
  
  class TestWhiteList:
  
-    @staticmethod
-    def run_sanitizer_on(whitelist, **kwargs):
+    @pytest.fixture(autouse=True)
+    def setup_country(self, def_config):
+        self.config = def_config
+
+
+    def run_sanitizer_on(self, whitelist, **kwargs):
          place = PlaceInfo({'name': {k.replace('_', ':'): v for k, v in kwargs.items()}})
          name, _ = PlaceSanitizer([{'step': 'tag-analyzer-by-language',
                                     'mode': 'replace',
-                                   'whitelist': whitelist}]).process_names(place)
+                                   'whitelist': whitelist}],
+                                 self.config).process_names(place)
  
          assert all(isinstance(p.attr, dict) for p in name)
          assert all(len(p.attr) <= 1 for p in name)
diff --git a/test/python/tokenizer/test_place_sanitizer.py b/test/python/tokenizer/test_place_sanitizer.py

index 31401bd19aa0eb73c161823023e537aef31d1726..3dd3033ca6f5031dae5f58276a5bada03e7cbec3 100644 (file)
--- a/test/python/tokenizer/test_place_sanitizer.py
+++ b/test/python/tokenizer/test_place_sanitizer.py
@@ -47,8 +47,8 @@ def test_placeinfo_has_attr():
      assert not place.has_attr('whatever')
  
  
-def test_sanitizer_default():
-    san = sanitizer.PlaceSanitizer([{'step': 'split-name-list'}])
+def test_sanitizer_default(def_config):
+    san = sanitizer.PlaceSanitizer([{'step': 'split-name-list'}], def_config)
  
      name, address =  san.process_names(PlaceInfo({'name': {'name:de:de': '1;2;3'},
                                                    'address': {'street': 'Bald'}}))
@@ -63,8 +63,8 @@ def test_sanitizer_default():
  
  
  @pytest.mark.parametrize('rules', [None, []])
-def test_sanitizer_empty_list(rules):
-    san = sanitizer.PlaceSanitizer(rules)
+def test_sanitizer_empty_list(def_config, rules):
+    san = sanitizer.PlaceSanitizer(rules, def_config)
  
      name, address =  san.process_names(PlaceInfo({'name': {'name:de:de': '1;2;3'}}))
  
@@ -72,6 +72,6 @@ def test_sanitizer_empty_list(rules):
      assert all(isinstance(n, sanitizer.PlaceName) for n in name)
  
  
-def test_sanitizer_missing_step_definition():
+def test_sanitizer_missing_step_definition(def_config):
      with pytest.raises(UsageError):
-        san = sanitizer.PlaceSanitizer([{'id': 'split-name-list'}])
+        san = sanitizer.PlaceSanitizer([{'id': 'split-name-list'}], def_config)
diff --git a/test/python/tokenizer/token_analysis/test_analysis_postcodes.py b/test/python/tokenizer/token_analysis/test_analysis_postcodes.py

index 623bed54a87eadf7b9b5994427af7339635b7698..8d966c46439b484d068ae543b5957e97768f294e 100644 (file)
--- a/test/python/tokenizer/token_analysis/test_analysis_postcodes.py
+++ b/test/python/tokenizer/token_analysis/test_analysis_postcodes.py
@@ -12,6 +12,7 @@ import pytest
  from icu import Transliterator
  
  import nominatim.tokenizer.token_analysis.postcodes as module
+from nominatim.data.place_name import PlaceName
  from nominatim.errors import UsageError
  
  DEFAULT_NORMALIZATION = """ :: NFD ();
@@ -39,22 +40,22 @@ def analyser():
  
  def get_normalized_variants(proc, name):
      norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
-    return proc.get_variants_ascii(norm.transliterate(name).strip())
+    return proc.compute_variants(norm.transliterate(name).strip())
  
  
  @pytest.mark.parametrize('name,norm', [('12', '12'),
                                         ('A 34 ', 'A 34'),
                                         ('34-av', '34-AV')])
-def test_normalize(analyser, name, norm):
-    assert analyser.normalize(name) == norm
+def test_get_canonical_id(analyser, name, norm):
+    assert analyser.get_canonical_id(PlaceName(name=name, kind='', suffix='')) == norm
  
  
  @pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
                                                 ('AB-998', {'ab 998', 'ab998'}),
                                                 ('23 FGH D3', {'23 fgh d3', '23fgh d3',
                                                                '23 fghd3', '23fghd3'})])
-def test_get_variants_ascii(analyser, postcode, variants):
-    out = analyser.get_variants_ascii(postcode)
+def test_compute_variants(analyser, postcode, variants):
+    out = analyser.compute_variants(postcode)
  
      assert len(out) == len(set(out))
      assert set(out) == variants
diff --git a/test/python/tokenizer/token_analysis/test_generic.py b/test/python/tokenizer/token_analysis/test_generic.py

index afbd5e9bf813590ff6537f4893fd8325b48f1d09..976bbd1b6515dc325ec4df1ea23753340e8ece3c 100644 (file)
--- a/test/python/tokenizer/token_analysis/test_generic.py
+++ b/test/python/tokenizer/token_analysis/test_generic.py
@@ -30,23 +30,23 @@ def make_analyser(*variants, variant_only=False):
      rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
      if variant_only:
          rules['mode'] = 'variant-only'
-    config = module.configure(rules, DEFAULT_NORMALIZATION)
      trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
      norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+    config = module.configure(rules, norm, trans)
  
      return module.create(norm, trans, config)
  
  
  def get_normalized_variants(proc, name):
      norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
-    return proc.get_variants_ascii(norm.transliterate(name).strip())
+    return proc.compute_variants(norm.transliterate(name).strip())
  
  
  def test_no_variants():
      rules = { 'analyzer': 'generic' }
-    config = module.configure(rules, DEFAULT_NORMALIZATION)
      trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
      norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+    config = module.configure(rules, norm, trans)
  
      proc = module.create(norm, trans, config)
  
@@ -123,7 +123,9 @@ class TestGetReplacements:
      @staticmethod
      def configure_rules(*variants):
          rules = { 'analyzer': 'generic', 'variants': [{'words': variants}]}
-        return module.configure(rules, DEFAULT_NORMALIZATION)
+        trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
+        norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+        return module.configure(rules, norm, trans)
  
  
      def get_replacements(self, *variants):
diff --git a/test/python/tokenizer/token_analysis/test_generic_mutation.py b/test/python/tokenizer/token_analysis/test_generic_mutation.py

index abe31f6d468ac631f86dbd1a1dc8d25205bbcdcc..ff4c3a74c455a60533167ae339277ad1a5ecee2f 100644 (file)
--- a/test/python/tokenizer/token_analysis/test_generic_mutation.py
+++ b/test/python/tokenizer/token_analysis/test_generic_mutation.py
@@ -31,16 +31,16 @@ class TestMutationNoVariants:
                    'mutations': [ {'pattern': m[0], 'replacements': m[1]}
                                   for m in mutations]
                  }
-        config = module.configure(rules, DEFAULT_NORMALIZATION)
          trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
          norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
+        config = module.configure(rules, norm, trans)
  
          self.analysis = module.create(norm, trans, config)
  
  
      def variants(self, name):
          norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
-        return set(self.analysis.get_variants_ascii(norm.transliterate(name).strip()))
+        return set(self.analysis.compute_variants(norm.transliterate(name).strip()))
  
  
      @pytest.mark.parametrize('pattern', ('(capture)', ['a list']))
diff --git a/vagrant/Install-on-Centos-8.sh b/vagrant/Install-on-Centos-8.sh

deleted file mode 100755 (executable)

index c9278f9..0000000
--- a/vagrant/Install-on-Centos-8.sh
+++ /dev/null
@@ -1,217 +0,0 @@
-#!/bin/bash -ex
-#
-# *Note:* these installation instructions are also available in executable
-#         form for use with vagrant under `vagrant/Install-on-Centos-8.sh`.
-#
-# Installing the Required Software
-# ================================
-#
-# These instructions expect that you have a freshly installed CentOS version 8.
-# Make sure all packages are up-to-date by running:
-#
-    sudo dnf update -y
-
-# The standard CentOS repositories don't contain all the required packages,
-# you need to enable the EPEL repository as well. For example for SELinux
-# related redhat-hardened-cc1 package. To enable it on CentOS run:
-
-    sudo dnf install -y epel-release redhat-rpm-config
-
-# EPEL contains Postgres 9.6 and 10, but not PostGIS. Postgres 9.4+/10/11/12
-# and PostGIS 2.4/2.5/3.0 are availble from postgresql.org. Enable these
-# repositories and make sure, the binaries can be found:
-
-    sudo dnf -qy module disable postgresql
-    sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
-    export PATH=/usr/pgsql-12/bin:$PATH
-
-# Now you can install all packages needed for Nominatim:
-
-#DOCS:    :::sh
-    sudo dnf --enablerepo=powertools install -y postgresql12-server \
-                        postgresql12-contrib postgresql12-devel postgis30_12 \
-                        wget git cmake make gcc gcc-c++ libtool policycoreutils-python-utils \
-                        llvm-toolset ccache clang-tools-extra \
-                        php-pgsql php php-intl php-json libpq-devel \
-                        bzip2-devel proj-devel boost-devel \
-                        python3-pip python3-setuptools python3-devel \
-                        python3-psycopg2 \
-                        expat-devel zlib-devel libicu-devel
-
-    pip3 install --user python-dotenv psutil Jinja2 PyICU datrie pyyaml
-
-
-#
-# System Configuration
-# ====================
-#
-# The following steps are meant to configure a fresh CentOS installation
-# for use with Nominatim. You may skip some of the steps if you have your
-# OS already configured.
-#
-# Creating Dedicated User Accounts
-# --------------------------------
-#
-# Nominatim will run as a global service on your machine. It is therefore
-# best to install it under its own separate user account. In the following
-# we assume this user is called nominatim and the installation will be in
-# /srv/nominatim. To create the user and directory run:
-#
-#     sudo useradd -d /srv/nominatim -s /bin/bash -m nominatim
-#
-# You may find a more suitable location if you wish.
-#
-# To be able to copy and paste instructions from this manual, export
-# user name and home directory now like this:
-#
-if [ "x$USERNAME" == "x" ]; then       #DOCS:
-    export USERNAME=vagrant            #DOCS:    export USERNAME=nominatim
-    export USERHOME=/srv/nominatim
-    sudo mkdir -p /srv/nominatim       #DOCS:
-    sudo chown vagrant /srv/nominatim  #DOCS:
-fi                                     #DOCS:
-#
-# **Never, ever run the installation as a root user.** You have been warned.
-#
-# Make sure that system servers can read from the home directory:
-
-    chmod a+x $USERHOME
-
-# Setting up PostgreSQL
-# ---------------------
-#
-# CentOS does not automatically create a database cluster. Therefore, start
-# with initializing the database:
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then                               #DOCS:
-    sudo -u postgres /usr/pgsql-12/bin/pg_ctl initdb -D /var/lib/pgsql/12/data     #DOCS:
-    sudo mkdir /var/log/postgresql                                 #DOCS:
-    sudo chown postgres. /var/log/postgresql                       #DOCS:
-else                                                               #DOCS:
-    sudo /usr/pgsql-12/bin/postgresql-12-setup initdb
-fi                                                                 #DOCS:
-#
-# Next tune the postgresql configuration, which is located in
-# `/var/lib/pgsql/12/data/postgresql.conf`. See section *Postgres Tuning* in
-# [the installation page](../admin/Installation.md#postgresql-tuning)
-# for the parameters to change.
-#
-# Now start the postgresql service after updating this config file:
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then  #DOCS:
-    sudo -u postgres /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/12/data -l /var/log/postgresql/postgresql-12.log start  #DOCS:
-else                                  #DOCS:
-    sudo systemctl enable postgresql-12
-    sudo systemctl restart postgresql-12
-fi                                    #DOCS:
-
-#
-# Finally, we need to add two postgres users: one for the user that does
-# the import and another for the webserver which should access the database
-# only for reading:
-#
-
-    sudo -u postgres createuser -s $USERNAME
-    sudo -u postgres createuser apache
-
-#
-# Installing Nominatim
-# ====================
-#
-# Building and Configuration
-# --------------------------
-#
-# Get the source code from Github and change into the source directory
-#
-if [ "x$1" == "xyes" ]; then  #DOCS:    :::sh
-    cd $USERHOME
-    git clone --recursive https://github.com/openstreetmap/Nominatim.git
-    cd Nominatim
-else                               #DOCS:
-    cd $USERHOME/Nominatim         #DOCS:
-fi                                 #DOCS:
-
-# When installing the latest source from github, you also need to
-# download the country grid:
-
-if [ ! -f data/country_osm_grid.sql.gz ]; then       #DOCS:    :::sh
-    wget --no-verbose -O data/country_osm_grid.sql.gz https://www.nominatim.org/data/country_grid.sql.gz
-fi                                 #DOCS:
-
-# The code must be built in a separate directory. Create this directory,
-# then configure and build Nominatim in there:
-
-#DOCS:    :::sh
-    mkdir $USERHOME/build
-    cd $USERHOME/build
-    cmake $USERHOME/Nominatim
-    make
-    sudo make install
-
-#
-# Setting up the Apache Webserver
-# -------------------------------
-#
-# The webserver should serve the php scripts from the website directory of your
-# [project directory](../admin/Import.md#creating-the-project-directory).
-# This directory needs to exist when the webserver is configured.
-# Therefore set up a project directory and create the website directory:
-#
-    mkdir $USERHOME/nominatim-project
-    mkdir $USERHOME/nominatim-project/website
-#
-# You need to create an alias to the website directory in your apache
-# configuration. Add a separate nominatim configuration to your webserver:
-
-#DOCS:```sh
-sudo tee /etc/httpd/conf.d/nominatim.conf << EOFAPACHECONF
-<Directory "$USERHOME/nominatim-project/website">
-  Options FollowSymLinks MultiViews
-  AddType text/html   .php
-  DirectoryIndex search.php
-  Require all granted
-</Directory>
-
-Alias /nominatim $USERHOME/nominatim-project/website
-EOFAPACHECONF
-#DOCS:```
-
-sudo sed -i 's:#.*::' /etc/httpd/conf.d/nominatim.conf #DOCS:
-
-#
-# Then reload apache:
-#
-
-if [ "x$NOSYSTEMD" == "xyes" ]; then  #DOCS:
-    sudo httpd                        #DOCS:
-else                                  #DOCS:
-    sudo systemctl enable httpd
-    sudo systemctl restart httpd
-fi                                    #DOCS:
-
-#
-# Adding SELinux Security Settings
-# --------------------------------
-#
-# It is a good idea to leave SELinux enabled and enforcing, particularly
-# with a web server accessible from the Internet. At a minimum the
-# following SELinux labeling should be done for Nominatim:
-
-if [ "x$HAVE_SELINUX" != "xno" ]; then  #DOCS:
-    sudo semanage fcontext -a -t httpd_sys_content_t "/usr/local/nominatim/lib/lib-php(/.*)?"
-    sudo semanage fcontext -a -t httpd_sys_content_t "$USERHOME/nominatim-project/website(/.*)?"
-    sudo semanage fcontext -a -t lib_t "$USERHOME/nominatim-project/module/nominatim.so"
-    sudo restorecon -R -v /usr/local/lib/nominatim
-    sudo restorecon -R -v $USERHOME/nominatim-project
-fi                                 #DOCS:
-
-# You need to create a minimal configuration file that tells nominatim
-# the name of your webserver user:
-
-#DOCS:```sh
-echo NOMINATIM_DATABASE_WEBUSER="apache" | tee $USERHOME/nominatim-project/.env
-#DOCS:```
-
-
-# Nominatim is now ready to use. Continue with
-# [importing a database from OSM data](../admin/Import.md).
diff --git a/vagrant/Install-on-Ubuntu-18.sh b/vagrant/Install-on-Ubuntu-18.sh

index 3537bcf4486c0050298cb4afe2499a91c318907a..e36086e19159128eb59c418c4b9559e6a9182c92 100755 (executable)
--- a/vagrant/Install-on-Ubuntu-18.sh
+++ b/vagrant/Install-on-Ubuntu-18.sh
@@ -24,10 +24,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
      sudo apt install -y php-cgi
      sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
                          libboost-filesystem-dev libexpat1-dev zlib1g-dev\
-                        libbz2-dev libpq-dev libproj-dev \
+                        libbz2-dev libpq-dev \
                          postgresql-10-postgis-2.4 \
                          postgresql-contrib-10 postgresql-10-postgis-scripts \
-                        php php-pgsql php-intl libicu-dev python3-pip \
+                        php-cli php-pgsql php-intl libicu-dev python3-pip \
                          python3-psutil python3-jinja2 python3-yaml python3-icu git
  
  # Some of the Python packages that come with Ubuntu 18.04 are too old, so
diff --git a/vagrant/Install-on-Ubuntu-20.sh b/vagrant/Install-on-Ubuntu-20.sh

index 1ea180e84b64a4b76455b458e8e5fdd82c3dafb2..d364cccd7a9ab064a876c246348ee94cf4dbfd8b 100755 (executable)
--- a/vagrant/Install-on-Ubuntu-20.sh
+++ b/vagrant/Install-on-Ubuntu-20.sh
@@ -23,10 +23,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
      sudo apt install -y php-cgi
      sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
                          libboost-filesystem-dev libexpat1-dev zlib1g-dev \
-                        libbz2-dev libpq-dev libproj-dev \
+                        libbz2-dev libpq-dev \
                          postgresql-12-postgis-3 \
                          postgresql-contrib-12 postgresql-12-postgis-3-scripts \
-                        php php-pgsql php-intl libicu-dev python3-dotenv \
+                        php-cli php-pgsql php-intl libicu-dev python3-dotenv \
                          python3-psycopg2 python3-psutil python3-jinja2 \
                          python3-icu python3-datrie python3-yaml git
  
diff --git a/vagrant/Install-on-Ubuntu-22.sh b/vagrant/Install-on-Ubuntu-22.sh

index dbb70ffe6546ab9db484f2d22d9f2145ae044cc0..419a7313bdce36693884f49c86f969c46b6e308b 100755 (executable)
--- a/vagrant/Install-on-Ubuntu-22.sh
+++ b/vagrant/Install-on-Ubuntu-22.sh
@@ -23,10 +23,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
      sudo apt install -y php-cgi
      sudo apt install -y build-essential cmake g++ libboost-dev libboost-system-dev \
                          libboost-filesystem-dev libexpat1-dev zlib1g-dev \
-                        libbz2-dev libpq-dev libproj-dev \
+                        libbz2-dev libpq-dev \
                          postgresql-server-dev-14 postgresql-14-postgis-3 \
                          postgresql-contrib-14 postgresql-14-postgis-3-scripts \
-                        php php-pgsql php-intl libicu-dev python3-dotenv \
+                        php-cli php-pgsql php-intl libicu-dev python3-dotenv \
                          python3-psycopg2 python3-psutil python3-jinja2 \
                          python3-icu python3-datrie git
author	Sarah Hoffmann <lonvia@denofr.de>
	Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)
committer	Sarah Hoffmann <lonvia@denofr.de>
	Sun, 31 Jul 2022 17:20:21 +0000 (19:20 +0200)
CONTRIBUTING.md		patch \| blob \| history
VAGRANT.md		patch \| blob \| history
docs/CMakeLists.txt		patch \| blob \| history
docs/admin/Advanced-Installations.md		patch \| blob \| history
docs/admin/Deployment.md		patch \| blob \| history
docs/admin/Installation.md		patch \| blob \| history
docs/admin/Setup-Nominatim-UI.md		patch \| blob \| history
docs/admin/Update.md		patch \| blob \| history
docs/api/Output.md		patch \| blob \| history
docs/customize/Import-Styles.md		patch \| blob \| history
docs/develop/Database-Layout.md		patch \| blob \| history
docs/develop/Development-Environment.md		patch \| blob \| history
docs/develop/ICU-Tokenizer-Modules.md	[new file with mode: 0644]	patch \| blob
docs/develop/Tokenizers.md		patch \| blob \| history
docs/develop/data-sources.md		patch \| blob \| history
docs/extra.css		patch \| blob \| history
docs/mkdocs.yml		patch \| blob \| history
lib-php/Geocode.php		patch \| blob \| history
lib-php/ParameterParser.php		patch \| blob \| history
lib-php/Phrase.php		patch \| blob \| history
lib-php/ReverseGeocode.php		patch \| blob \| history
lib-php/SearchDescription.php		patch \| blob \| history
lib-php/TokenWord.php		patch \| blob \| history
lib-php/cmd.php		patch \| blob \| history
lib-php/lib.php		patch \| blob \| history
lib-php/website/details.php		patch \| blob \| history
lib-sql/functions/partition-functions.sql		patch \| blob \| history
lib-sql/functions/placex_triggers.sql		patch \| blob \| history
munin/nominatim_requests		patch \| blob \| history
nominatim/config.py		patch \| blob \| history
nominatim/data/place_info.py		patch \| blob \| history
nominatim/data/place_name.py	[new file with mode: 0644]	patch \| blob
nominatim/db/connection.py		patch \| blob \| history
nominatim/db/properties.py		patch \| blob \| history
nominatim/indexer/indexer.py		patch \| blob \| history
nominatim/indexer/progress.py		patch \| blob \| history
nominatim/tokenizer/base.py		patch \| blob \| history
nominatim/tokenizer/factory.py		patch \| blob \| history
nominatim/tokenizer/icu_rule_loader.py		patch \| blob \| history
nominatim/tokenizer/icu_token_analysis.py		patch \| blob \| history
nominatim/tokenizer/icu_tokenizer.py		patch \| blob \| history
nominatim/tokenizer/place_sanitizer.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/base.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/clean_housenumbers.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/clean_postcodes.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/config.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/base.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/config_variants.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/generic.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/generic_mutation.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/housenumbers.py		patch \| blob \| history
nominatim/tokenizer/token_analysis/postcodes.py		patch \| blob \| history
nominatim/tools/check_database.py		patch \| blob \| history
nominatim/tools/migration.py		patch \| blob \| history
nominatim/tools/refresh.py		patch \| blob \| history
nominatim/tools/special_phrases/sp_importer.py		patch \| blob \| history
nominatim/typing.py		patch \| blob \| history
nominatim/version.py		patch \| blob \| history
test/bdd/api/search/params.feature		patch \| blob \| history
test/php/Nominatim/ParameterParserTest.php		patch \| blob \| history
test/python/config/test_config_load_module.py	[new file with mode: 0644]	patch \| blob
test/python/db/test_connection.py		patch \| blob \| history
test/python/tokenizer/sanitizers/test_clean_housenumbers.py		patch \| blob \| history
test/python/tokenizer/sanitizers/test_clean_postcodes.py		patch \| blob \| history
test/python/tokenizer/sanitizers/test_split_name_list.py		patch \| blob \| history
test/python/tokenizer/sanitizers/test_strip_brace_terms.py		patch \| blob \| history
test/python/tokenizer/sanitizers/test_tag_analyzer_by_language.py		patch \| blob \| history
test/python/tokenizer/test_place_sanitizer.py		patch \| blob \| history
test/python/tokenizer/token_analysis/test_analysis_postcodes.py		patch \| blob \| history
test/python/tokenizer/token_analysis/test_generic.py		patch \| blob \| history
test/python/tokenizer/token_analysis/test_generic_mutation.py		patch \| blob \| history
vagrant/Install-on-Centos-8.sh	[deleted file]	patch \| blob \| history
vagrant/Install-on-Ubuntu-18.sh		patch \| blob \| history
vagrant/Install-on-Ubuntu-20.sh		patch \| blob \| history
vagrant/Install-on-Ubuntu-22.sh		patch \| blob \| history