X-Git-Url: https://git.openstreetmap.org./nominatim.git/blobdiff_plain/c4b8a3b7680ae51da4c7f0ac0849ecd9fe3d5660..8e1d4818aca13a7a0a992af7ba8c5b9ff47f65ed:/docs/develop/Tokenizers.md diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md index 743e637c..529315e4 100644 --- a/docs/develop/Tokenizers.md +++ b/docs/develop/Tokenizers.md @@ -50,7 +50,7 @@ tokenizer's internal token lists and creating a list of all token IDs for the specific place. This list is later needed in the PL/pgSQL part where the indexer needs to add the token IDs to the appropriate search tables. To be able to communicate the list between the Python part and the pl/pgSQL trigger, -the placex table contains a special JSONB column `token_info` which is there +the `placex` table contains a special JSONB column `token_info` which is there for the exclusive use of the tokenizer. The Python part of the tokenizer returns a structured information about the @@ -67,12 +67,17 @@ consequently not create any special indexes on it. ### Querying -The tokenizer is responsible for the initial parsing of the query. It needs -to split the query into appropriate words and terms and match them against -the saved tokens in the database. It then returns the list of possibly matching -tokens and the list of possible splits to the query parser. The parser uses -this information to compute all possible interpretations of the query and -rank them accordingly. +At query time, Nominatim builds up multiple _interpretations_ of the search +query. Each of these interpretations is tried against the database in order +of the likelihood with which they match to the search query. The first +interpretation that yields results wins. + +The interpretations are encapsulated in the `SearchDescription` class. An +instance of this class is created by applying a sequence of +_search tokens_ to an initially empty SearchDescription. It is the +responsibility of the tokenizer to parse the search query and derive all +possible sequences of search tokens. To that end the tokenizer needs to parse +the search query and look up matching words in its own data structures. ## Tokenizer API @@ -301,6 +306,14 @@ public function extractTokensFromPhrases(array &$aPhrases) : TokenList Parse the given phrases, splitting them into word lists and retrieve the matching tokens. +The phrase array may take on two forms. In unstructured searches (using `q=` +parameter) the search query is split at the commas and the elements are +put into a sorted list. For structured searches the phrase array is an +associative array where the key designates the type of the term (street, city, +county etc.) The tokenizer may ignore the phrase type at this stage in parsing. +Matching phrase type and appropriate search token type will be done later +when the SearchDescription is built. + For each phrase in the list of phrases, the function must analyse the phrase string and then call `setWordSets()` to communicate the result of the analysis. A word set is a list of strings, where each string refers to a search token.