Merge remote-tracking branch 'upstream/master'

[nominatim.git] / docs / develop / Tokenizers.md
diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md

index 529315e4431dd1b2d08097ef7ed92491989c2de1..050371771c27eb21ffa7158efbc261e8ae35154d 100644 (file)
--- a/docs/develop/Tokenizers.md
+++ b/docs/develop/Tokenizers.md
@@ -6,7 +6,7 @@ tokenizers that use different strategies for normalisation. This page describes
  how tokenizers are expected to work and the public API that needs to be
  implemented when creating a new tokenizer. For information on how to configure
  a specific tokenizer for a database see the
-[tokenizer chapter in the administration guide](../admin/Tokenizers.md).
+[tokenizer chapter in the Customization Guide](../customize/Tokenizers.md).
  
  ## Generic Architecture
  
@@ -93,7 +93,7 @@ for a custom tokenizer implementation.
  
  Nominatim expects two files for a tokenizer:
  
-* `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
+* `nominatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
    implementation
  * `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
    implementation
@@ -105,7 +105,7 @@ functions. By convention, these should be placed in `lib-sql/tokenizer`.
  If the tokenizer has a default configuration file, this should be saved in
  the `settings/<NAME>_tokenizer.<SUFFIX>`.
  
-### Configuration and Persistance
+### Configuration and Persistence
  
  Tokenizers may define custom settings for their configuration. All settings
  must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
@@ -134,14 +134,14 @@ All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer`
  and implement the abstract functions defined there.
  
  ::: nominatim.tokenizer.base.AbstractTokenizer
-    rendering:
-        heading_level: 4
+    options:
+        heading_level: 6
  
  ### Python Analyzer Class
  
  ::: nominatim.tokenizer.base.AbstractAnalyzer
-    rendering:
-        heading_level: 4
+    options:
+        heading_level: 6
  
  ### PL/pgSQL Functions
  
@@ -190,22 +190,43 @@ be listed with a semicolon as delimiter. Must be NULL when the place has no
  house numbers.
  
  ```sql
-FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[]
+FUNCTION token_is_street_address(info JSONB) RETURNS BOOLEAN
  ```
  
-Return the match token IDs by which to search a matching street from the
-`addr:street` tag. These IDs will be matched against the IDs supplied by
-`token_get_name_match_tokens`. Must be NULL when the place has no `addr:street`
-tag.
+Return true if this is an object that should be parented against a street.
+Only relevant for objects with address rank 30.
  
  ```sql
-FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[]
+FUNCTION token_has_addr_street(info JSONB) RETURNS BOOLEAN
  ```
  
-Return the match token IDs by which to search a matching place from the
-`addr:place` tag. These IDs will be matched against the IDs supplied by
-`token_get_name_match_tokens`. Must be NULL when the place has no `addr:place`
-tag.
+Return true if there are street names to match against for finding the
+parent of the object.
+
+
+```sql
+FUNCTION token_has_addr_place(info JSONB) RETURNS BOOLEAN
+```
+
+Return true if there are place names to match against for finding the
+parent of the object.
+
+```sql
+FUNCTION token_matches_street(info JSONB, street_tokens INTEGER[]) RETURNS BOOLEAN
+```
+
+Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
+match against the `addr:street` tag name. Must return either NULL or FALSE
+when the place has no `addr:street` tag.
+
+```sql
+FUNCTION token_matches_place(info JSONB, place_tokens INTEGER[]) RETURNS BOOLEAN
+```
+
+Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
+match against the `addr:place` tag name. Must return either NULL or FALSE
+when the place has no `addr:place` tag.
+
  
  ```sql
  FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]
@@ -216,33 +237,41 @@ are used for searches by address when no matching place can be found in the
  database. Must be NULL when the place has no `addr:place` tag.
  
  ```sql
-CREATE TYPE token_addresstoken AS (
-  key TEXT,
-  match_tokens INT[],
-  search_tokens INT[]
-);
+FUNCTION token_get_address_keys(info JSONB) RETURNS SETOF TEXT
+```
+
+Return the set of keys for which address information is provided. This
+should correspond to the list of (relevant) `addr:*` tags with the `addr:`
+prefix removed or the keys used in the `address` dictionary of the place info.
+
+```sql
+FUNCTION token_get_address_search_tokens(info JSONB, key TEXT) RETURNS INTEGER[]
+```
  
-FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken
+Return the array of search tokens for the given address part. `key` can be
+expected to be one of those returned with `token_get_address_keys()`. The
+search tokens are added to the address search vector of the place, when no
+corresponding OSM object could be found for the given address part from which
+to copy the name information.
+
+```sql
+FUNCTION token_matches_address(info JSONB, key TEXT, tokens INTEGER[])
  ```
  
-Return the match and search token IDs for explicit `addr:*` tags for the place
-other than `addr:street` and `addr:place`. For each address item there are
-three pieces of information returned:
+Check if the given tokens match against the address part `key`.
  
- * _key_ contains the type of address item (city, county, etc.). This is the
-   key handed in with the `address` dictionary.
- * *match_tokens* is the list of token IDs used to find the corresponding
-   place object for the address part. The list is matched against the IDs
-   from `token_get_name_match_tokens`.
- * *search_tokens* is the list of token IDs under which to search the address
-   item. It is used when no corresponding place object was found.
+__Warning:__ the tokens that are handed in are the lists previously saved
+from `token_get_name_search_tokens()`, _not_ from the match token list. This
+is an historical oddity which will be fixed at some point in the future.
+Currently, tokenizers are encouraged to make sure that matching works against
+both the search token list and the match token list.
  
  ```sql
-FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
+FUNCTION token_get_postcode(info JSONB) RETURNS TEXT
  ```
  
-Return the normalized version of the given postcode. This function must return
-the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
+Return the postcode for the object, if any exists. The postcode must be in
+the form that should also be presented to the end-user.
  
  ```sql
  FUNCTION token_strip_info(info JSONB) RETURNS JSONB