Merge pull request #2425 from lonvia/tokenizer-documentation

[nominatim.git] / docs / develop / Tokenizers.md
diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md

index b860ed36dc41142423d34160bc90744e4f8c7c0b..529315e4431dd1b2d08097ef7ed92491989c2de1 100644 (file)
--- a/docs/develop/Tokenizers.md
+++ b/docs/develop/Tokenizers.md
@@ -50,7 +50,7 @@ tokenizer's internal token lists and creating a list of all token IDs for
  the specific place. This list is later needed in the PL/pgSQL part where the
  indexer needs to add the token IDs to the appropriate search tables. To be
  able to communicate the list between the Python part and the pl/pgSQL trigger,
-the placex table contains a special JSONB column `token_info` which is there
+the `placex` table contains a special JSONB column `token_info` which is there
  for the exclusive use of the tokenizer.
  
  The Python part of the tokenizer returns a structured information about the
@@ -67,12 +67,17 @@ consequently not create any special indexes on it.
  
  ### Querying
  
-The tokenizer is responsible for the initial parsing of the query. It needs
-to split the query into appropriate words and terms and match them against
-the saved tokens in the database. It then returns the list of possibly matching
-tokens and the list of possible splits to the query parser. The parser uses
-this information to compute all possible interpretations of the query and
-rank them accordingly.
+At query time, Nominatim builds up multiple _interpretations_ of the search
+query. Each of these interpretations is tried against the database in order
+of the likelihood with which they match to the search query. The first
+interpretation that yields results wins.
+
+The interpretations are encapsulated in the `SearchDescription` class. An
+instance of this class is created by applying a sequence of
+_search tokens_ to an initially empty SearchDescription. It is the
+responsibility of the tokenizer to parse the search query and derive all
+possible sequences of search tokens. To that end the tokenizer needs to parse
+the search query and look up matching words in its own data structures.
  
  ## Tokenizer API
  
@@ -88,7 +93,7 @@ for a custom tokenizer implementation.
  
  Nominatim expects two files for a tokenizer:
  
-* `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Pythonpart of the
+* `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
    implementation
  * `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
    implementation
@@ -137,3 +142,184 @@ and implement the abstract functions defined there.
  ::: nominatim.tokenizer.base.AbstractAnalyzer
      rendering:
          heading_level: 4
+
+### PL/pgSQL Functions
+
+The tokenizer must provide access functions for the `token_info` column
+to the indexer which extracts the necessary information for the global
+search tables. If the tokenizer needs additional SQL functions for private
+use, then these functions must be prefixed with `token_` in order to ensure
+that there are no naming conflicts with the SQL indexer code.
+
+The following functions are expected:
+
+```sql
+FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return an array of token IDs of search terms that should match
+the name(s) for the given place. These tokens are used to look up the place
+by name and, where the place functions as part of an address for another place,
+by address. Must return NULL when the place has no name.
+
+```sql
+FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return an array of token IDs of full names of the place that should be used
+to match addresses. The list of match tokens is usually more strict than
+search tokens as it is used to find a match between two OSM tag values which
+are expected to contain matching full names. Partial terms should not be
+used for match tokens. Must return NULL when the place has no name.
+
+```sql
+FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return an array of token IDs of house number tokens that apply to the place.
+Note that a place may have multiple house numbers, for example when apartments
+each have their own number. Must be NULL when the place has no house numbers.
+
+```sql
+FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT
+```
+
+Return the house number(s) in the normalized form that can be matched against
+a house number token text. If a place has multiple house numbers they must
+be listed with a semicolon as delimiter. Must be NULL when the place has no
+house numbers.
+
+```sql
+FUNCTION token_addr_street_match_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return the match token IDs by which to search a matching street from the
+`addr:street` tag. These IDs will be matched against the IDs supplied by
+`token_get_name_match_tokens`. Must be NULL when the place has no `addr:street`
+tag.
+
+```sql
+FUNCTION token_addr_place_match_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return the match token IDs by which to search a matching place from the
+`addr:place` tag. These IDs will be matched against the IDs supplied by
+`token_get_name_match_tokens`. Must be NULL when the place has no `addr:place`
+tag.
+
+```sql
+FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]
+```
+
+Return the search token IDs extracted from the `addr:place` tag. These tokens
+are used for searches by address when no matching place can be found in the
+database. Must be NULL when the place has no `addr:place` tag.
+
+```sql
+CREATE TYPE token_addresstoken AS (
+  key TEXT,
+  match_tokens INT[],
+  search_tokens INT[]
+);
+
+FUNCTION token_get_address_tokens(info JSONB) RETURNS SETOF token_addresstoken
+```
+
+Return the match and search token IDs for explicit `addr:*` tags for the place
+other than `addr:street` and `addr:place`. For each address item there are
+three pieces of information returned:
+
+ * _key_ contains the type of address item (city, county, etc.). This is the
+   key handed in with the `address` dictionary.
+ * *match_tokens* is the list of token IDs used to find the corresponding
+   place object for the address part. The list is matched against the IDs
+   from `token_get_name_match_tokens`.
+ * *search_tokens* is the list of token IDs under which to search the address
+   item. It is used when no corresponding place object was found.
+
+```sql
+FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
+```
+
+Return the normalized version of the given postcode. This function must return
+the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
+
+```sql
+FUNCTION token_strip_info(info JSONB) RETURNS JSONB
+```
+
+Return the part of the `token_info` field that should be stored in the database
+permanently. The indexer calls this function when all processing is done and
+replaces the content of the `token_info` column with the returned value before
+the trigger stores the information in the database. May return NULL if no
+information should be stored permanently.
+
+### PHP Tokenizer class
+
+The PHP tokenizer class is instantiated once per request and responsible for
+analyzing the incoming query. Multiple requests may be in flight in
+parallel.
+
+The class is expected to be found under the
+name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file
+`tokenizer/tokenizer.php` in the project directory. This file must be created
+when the tokenizer is first set up on import. The file should initialize any
+configuration variables by setting PHP constants and then require the file
+with the actual implementation of the tokenizer.
+
+The tokenizer class must implement the following functions:
+
+```php
+public function __construct(object &$oDB)
+```
+
+The constructor of the class receives a database connection that can be used
+to query persistent data in the database.
+
+```php
+public function checkStatus()
+```
+
+Check that the tokenizer can access its persistent data structures. If there
+is an issue, throw an `\Exception`.
+
+```php
+public function normalizeString(string $sTerm) : string
+```
+
+Normalize string to a form to be used for comparisons when reordering results.
+Nominatim reweighs results how well the final display string matches the actual
+query. Before comparing result and query, names and query are normalised against
+this function. The tokenizer can thus remove all properties that should not be
+taken into account for reweighing, e.g. special characters or case.
+
+```php
+public function tokensForSpecialTerm(string $sTerm) : array
+```
+
+Return the list of special term tokens that match the given term.
+
+```php
+public function extractTokensFromPhrases(array &$aPhrases) : TokenList
+```
+
+Parse the given phrases, splitting them into word lists and retrieve the
+matching tokens.
+
+The phrase array may take on two forms. In unstructured searches (using `q=`
+parameter) the search query is split at the commas and the elements are
+put into a sorted list. For structured searches the phrase array is an
+associative array where the key designates the type of the term (street, city,
+county etc.) The tokenizer may ignore the phrase type at this stage in parsing.
+Matching phrase type and appropriate search token type will be done later
+when the SearchDescription is built.
+
+For each phrase in the list of phrases, the function must analyse the phrase
+string and then call `setWordSets()` to communicate the result of the analysis.
+A word set is a list of strings, where each string refers to a search token.
+A phrase may have multiple interpretations. Therefore a list of word sets is
+usually attached to the phrase. The search tokens themselves are returned
+by the function in an associative array, where the key corresponds to the
+strings given in the word sets. The value is a list of search tokens. Thus
+a single string in the list of word sets may refer to multiple search tokens.
+