From 5b86b2078a19d68d897c2465adcb73e1c6683d9e Mon Sep 17 00:00:00 2001
From: Sarah Hoffmann <lonvia@denofr.de>
Date: Mon, 1 Nov 2021 11:04:03 +0100
Subject: [PATCH] docs: add overview over indexing

---
 docs/develop/Indexing.md             | 152 +++++++++++++++++++++++++++
 docs/develop/parenting-flow.plantuml |  31 ++++++
 docs/develop/parenting-flow.svg      |  41 ++++++++
 docs/mkdocs.yml                      |   1 +
 4 files changed, 225 insertions(+)
 create mode 100644 docs/develop/Indexing.md
 create mode 100644 docs/develop/parenting-flow.plantuml
 create mode 100644 docs/develop/parenting-flow.svg

diff --git a/docs/develop/Indexing.md b/docs/develop/Indexing.md
new file mode 100644
index 00000000..22959e22
--- /dev/null
+++ b/docs/develop/Indexing.md
@@ -0,0 +1,152 @@
+# Indexing Places
+
+In Nominatim, the word __indexing__ refers to the process that takes the raw
+OpenStreetMap data from the place table, enriches it with address information
+and creates the search indexes. This section explains the basic data flow.
+
+
+## Initial import
+
+After osm2pgsql has loaded the raw OSM data into the place table,
+the data is copied to the final search tables placex and location_property_osmline.
+While they are copied, some basic properties are added:
+
+ * country_code, geometry_sector and partition
+ * initial search and address rank
+
+In addition the column `indexed_status` is set to `1` marking the place as one
+that needs to be indexed.
+
+All this happens in the triggers `placex_insert` and `osmline_insert`.
+
+## Indexing
+
+The main work horse of the data import is the indexing step, where Nominatim
+takes every place from the placex and location_property_osmline tables where
+the indexed_status != 0 and computes the search terms and the address parts
+of the place.
+
+The indexing happens in three major steps:
+
+1. **Data preparation** - The indexer gets the data for the place to be indexed
+   from the database.
+
+2. **Search name processing** - The prepared data is given to the
+   tokenizer which computes the search terms from the names
+   and potentially other information.
+
+3. **Address processing** - The indexer then hands the prepared data and the
+   tokenizer information back to the database via an `INSERT` statement which
+   also sets the indexed_status to `0`. This triggers the update triggers
+   `placex_update`/`osmline_update` which do the work of computing address
+   parts and filling all the search tables.
+
+When computing the address terms of a place, Nominatim relies on the processed
+search names of all the address parts. That is why places are processed in rank
+order, from smallest rank to largest. To ensure correct handling of linked
+place nodes, administrative boundaries are processed before all other places.
+
+Apart from these restrictions, each place can be indexed independently
+from the others. This allows a large degree of parallelization during the indexing.
+It also means that the indexing process can be interrupted at any time and
+will simply pick up where it left of when restarted.
+
+### Data preparation
+
+The data preparation step computes and retrieves all data for a place that
+might be needed for the next step of processing the search name. That includes
+
+* location information (country code)
+* place classification (class, type, ranks)
+* names (including names of linked places)
+* address information (`addr:*` tags)
+
+Data preparation is implemented in pl/PgSQL mostly in the functions
+`placex_indexing_prepare()` and `get_interpolation_address()`.
+
+#### `addr:*` tag inheritance
+
+Nominatim has limited support for inheriting address tags from a building
+to POIs inside the building. This only works when the address tags are on the
+building outline. Any rank 30 object inside such a building or on its outline
+inherits all address tags when it does not have any address tags of its own.
+
+The inheritance is computed in the data preparation step.
+
+### Search name processing
+
+The prepared place information is handed to the tokenizer next. This is a
+Python module responsible for processing the names  from both name and address
+terms and building up the word index from them. The process is explained in
+more detail in the [Tokenizer chapter](Tokenizer.md).
+
+### Address processing
+
+Finally, the preprocessed place information and the results of the search name
+processing are written back to the database. At this point the update trigger
+of the placex/location_property_osmline tables take over and fill all the
+dependent tables. This makes up the most work-intensive part of the indexing.
+
+Nominatim distinguishes between dependent and independent places.
+**Dependent places** are all places on rank 30: house numbers, POIs etc. These
+places don't have a full address of their own. Instead they are attached to
+a parent street or place and use the information of the parent for searching
+and displaying information. Everything else are **independent places**: streets,
+parks, water bodies, suburbs, cities, states etc.  They receive a full address
+on their own.
+
+The address processing for both types of places is very different.
+
+#### Independent places
+
+To compute the address of an independent place Nominatim searches for all
+places that cover the place to compute the address for at least partially.
+For places with an area, that area is used to check for coverage. For place
+nodes an artificial square area is computed according to the rank of
+the place. The lower the rank the lager the area. The `location_area_large_X`
+tables are there to facilitate the lookup. All places that can function as
+the address of another place are saved in those tables.
+
+`addr:*` and `isin:*` tags are taken into account to compute the address, too.
+Nominatim will give preference to places with the same name as in these tags
+when looking for places in the vicinity. If there are no matching place names
+at all, then the tags are at least added to the search index. That means that
+the names will not be shown in the result as the 'address' of the place, but
+searching by them still works.
+
+Independent places are always added to the global search index `search_name`.
+
+#### Dependent places
+
+Dependent places skip the full address computation for performance reasons.
+Instead they just find a parent place to attach themselves to.
+
+![parenting of dependent places](parenting-flow.svg)
+
+By default a POI
+or house number will be attached to the closest street. That can be any major
+or minor street indexed by Nominatim. In the default configuration that means
+that it can attach itself to a footway but only when it has a name.
+
+When the dependent place has an `addr:street` tag, then Nominatim will first
+try to find a street with the same name before falling back to the closest
+street.
+
+There are also addresses in OSM, where the housenumber does not belong
+to a street at all. These have an `addr:place` tag. For these places, Nominatim
+tries to find a place with the given name in the indexed places with an
+address rank between 16 and 25. If none is found, then the dependent place
+is attached to the closest place in that category and the addr:place name is
+added as *unlisted* place, which indicates to Nominatim that it needs to add
+it to the address output, no matter what. This special case is necessary to
+cover addresses that don't really refer to an existing object.
+
+When an address has both the `addr:street` and `addr:place` tag, then Nominatim
+assumes that the `addr:place` tag in fact should be the city part of the address
+and give the POI the usual street number address.
+
+Dependent places are only added to the global search index `search_name` when
+they have either a name themselves or when they have address tags that are not
+covered by the places that make up their address. The latter ensures that
+addresses are always searchable by those address tags.
+
diff --git a/docs/develop/parenting-flow.plantuml b/docs/develop/parenting-flow.plantuml
new file mode 100644
index 00000000..ade927c6
--- /dev/null
+++ b/docs/develop/parenting-flow.plantuml
@@ -0,0 +1,31 @@
+@startuml
+skinparam monochrome true
+
+start
+
+if (has 'addr:street'?) then (yes)
+  if (street with that name\n nearby?) then (yes)
+    :**Use closest street**
+     **with same name**;
+     kill
+  else (no)
+    :** Use closest**\n**street**;
+     kill
+  endif
+elseif (has 'addr:place'?) then (yes)
+  if (place with that name\n nearby?) then (yes)
+    :**Use closest place**
+     **with same name**;
+     kill
+  else (no)
+    :add addr:place to adress;
+    :**Use closest place**\n**rank 16 to 25**;
+     kill
+  endif
+else (otherwise)
+ :**Use closest**\n**street**;
+ kill
+endif
+
+
+@enduml
diff --git a/docs/develop/parenting-flow.svg b/docs/develop/parenting-flow.svg
new file mode 100644
index 00000000..7e8271a9
--- /dev/null
+++ b/docs/develop/parenting-flow.svg
@@ -0,0 +1,41 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" contentScriptType="application/ecmascript" contentStyleType="text/css" height="275px" preserveAspectRatio="none" style="width:785px;height:275px;background:#FFFFFF;" version="1.1" viewBox="0 0 785 275" width="785px" zoomAndPan="magnify"><defs><filter height="300%" id="f1b513ppngo123" width="300%" x="-1" y="-1"><feGaussianBlur result="blurOut" stdDeviation="2.0"/><feColorMatrix in="blurOut" result="blurOut2" type="matrix" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 0"/><feOffset dx="4.0" dy="4.0" in="blurOut2" result="blurOut3"/><feBlend in="SourceGraphic" in2="blurOut3" mode="normal"/></filter></defs><g><ellipse cx="379.5" cy="20" fill="#000000" filter="url(#f1b513ppngo123)" rx="10" ry="10" style="stroke:none;stroke-width:1.0;"/><polygon fill="#F8F8F8" filter="url(#f1b513ppngo123)" points="118,50,218,50,230,62,218,74,118,74,106,62,118,50" style="stroke:#383838;stroke-width:1.5;"/><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="20" x="172" y="84.2104">yes</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="100" x="118" y="65.8081">has 'addr:street'?</text><polygon fill="#F8F8F8" filter="url(#f1b513ppngo123)" points="108,105.7104,228,105.7104,240,118.5151,228,131.3198,108,131.3198,96,118.5151,108,105.7104" style="stroke:#383838;stroke-width:1.5;"/><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="120" x="108" y="115.9209">street with that name</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="45" x="111" y="128.7256">nearby?</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="20" x="76" y="115.9209">yes</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="14" x="240" y="115.9209">no</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="47.9375" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="150" x="11" y="141.3198"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="130" x="21" y="162.4585">Use closest street</text><text fill="#000000" font-family="sans-serif" font-size="12" lengthAdjust="spacing" textLength="0" x="25" y="176.4272"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="116" x="25" y="176.4272">with same name</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="47.9375" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="106" x="197" y="141.3198"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="82" x="211" y="162.4585">Use closest</text><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="44" x="207" y="176.4272">street</text><polygon fill="#F8F8F8" filter="url(#f1b513ppngo123)" points="427.75,50,523.75,50,535.75,62,523.75,74,427.75,74,415.75,62,427.75,50" style="stroke:#383838;stroke-width:1.5;"/><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="20" x="479.75" y="84.2104">yes</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="96" x="427.75" y="65.8081">has 'addr:place'?</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="56" x="535.75" y="59.4058">otherwise</text><polygon fill="#F8F8F8" filter="url(#f1b513ppngo123)" points="417.75,105.7104,533.75,105.7104,545.75,118.5151,533.75,131.3198,417.75,131.3198,405.75,118.5151,417.75,105.7104" style="stroke:#383838;stroke-width:1.5;"/><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="116" x="417.75" y="115.9209">place with that name</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="45" x="420.75" y="128.7256">nearby?</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="20" x="385.75" y="115.9209">yes</text><text fill="#000000" font-family="sans-serif" font-size="11" lengthAdjust="spacing" textLength="14" x="545.75" y="115.9209">no</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="47.9375" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="144" x="313" y="141.3198"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="124" x="323" y="162.4585">Use closest place</text><text fill="#000000" font-family="sans-serif" font-size="12" lengthAdjust="spacing" textLength="0" x="327" y="176.4272"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="116" x="327" y="176.4272">with same name</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="33.9688" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="179" x="477" y="141.3198"/><text fill="#000000" font-family="sans-serif" font-size="12" lengthAdjust="spacing" textLength="159" x="487" y="162.4585">add addr:place to adress</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="47.9375" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="144" x="494.5" y="210.2886"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="124" x="504.5" y="231.4272">Use closest place</text><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="91" x="504.5" y="245.396">rank 16 to 25</text><rect fill="#F8F8F8" filter="url(#f1b513ppngo123)" height="47.9375" rx="12.5" ry="12.5" style="stroke:#383838;stroke-width:1.5;" width="102" x="666" y="157.5972"/><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="82" x="676" y="178.7358">Use closest</text><text fill="#000000" font-family="sans-serif" font-size="12" font-weight="bold" lengthAdjust="spacing" textLength="44" x="676" y="192.7046">street</text><line style="stroke:#383838;stroke-width:1.5;" x1="96" x2="86" y1="118.5151" y2="118.5151"/><line style="stroke:#383838;stroke-width:1.5;" x1="86" x2="86" y1="118.5151" y2="141.3198"/><polygon fill="#383838" points="82,131.3198,86,141.3198,90,131.3198,86,135.3198" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="240" x2="250" y1="118.5151" y2="118.5151"/><line style="stroke:#383838;stroke-width:1.5;" x1="250" x2="250" y1="118.5151" y2="141.3198"/><polygon fill="#383838" points="246,131.3198,250,141.3198,254,131.3198,250,135.3198" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="566.5" x2="566.5" y1="175.2886" y2="210.2886"/><polygon fill="#383838" points="562.5,200.2886,566.5,210.2886,570.5,200.2886,566.5,204.2886" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="405.75" x2="385" y1="118.5151" y2="118.5151"/><line style="stroke:#383838;stroke-width:1.5;" x1="385" x2="385" y1="118.5151" y2="141.3198"/><polygon fill="#383838" points="381,131.3198,385,141.3198,389,131.3198,385,135.3198" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="545.75" x2="566.5" y1="118.5151" y2="118.5151"/><line style="stroke:#383838;stroke-width:1.5;" x1="566.5" x2="566.5" y1="118.5151" y2="141.3198"/><polygon fill="#383838" points="562.5,131.3198,566.5,141.3198,570.5,131.3198,566.5,135.3198" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="168" x2="168" y1="74" y2="105.7104"/><polygon fill="#383838" points="164,95.7104,168,105.7104,172,95.7104,168,99.7104" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="475.75" x2="475.75" y1="74" y2="105.7104"/><polygon fill="#383838" points="471.75,95.7104,475.75,105.7104,479.75,95.7104,475.75,99.7104" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="230" x2="415.75" y1="62" y2="62"/><polygon fill="#383838" points="405.75,58,415.75,62,405.75,66,409.75,62" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="379.5" x2="379.5" y1="30" y2="35"/><line style="stroke:#383838;stroke-width:1.5;" x1="379.5" x2="168" y1="35" y2="35"/><line style="stroke:#383838;stroke-width:1.5;" x1="168" x2="168" y1="35" y2="50"/><polygon fill="#383838" points="164,40,168,50,172,40,168,44" style="stroke:#383838;stroke-width:1.0;"/><line style="stroke:#383838;stroke-width:1.5;" x1="535.75" x2="717" y1="62" y2="62"/><line style="stroke:#383838;stroke-width:1.5;" x1="717" x2="717" y1="62" y2="157.5972"/><polygon fill="#383838" points="713,147.5972,717,157.5972,721,147.5972,717,151.5972" style="stroke:#383838;stroke-width:1.0;"/><!--MD5=[e03d31a5684b671bb715075c57004ccb]
+@startuml
+skinparam monochrome true
+
+start
+
+if (has 'addr:street'?) then (yes)
+  if (street with that name\n nearby?) then (yes)
+    :**Use closest street**
+     **with same name**;
+     kill
+  else (no)
+    :** Use closest**\n**street**;
+     kill
+  endif
+elseif (has 'addr:place'?) then (yes)
+  if (place with that name\n nearby?) then (yes)
+    :**Use closest place**
+     **with same name**;
+     kill
+  else (no)
+    :add addr:place to adress;
+    :**Use closest place**\n**rank 16 to 25**;
+     kill
+  endif
+else (otherwise)
+ :**Use closest**\n**street**;
+ kill
+endif
+
+
+@enduml
+
+PlantUML version 1.2021.12(Tue Oct 05 18:01:58 CEST 2021)
+(GPL source distribution)
+Java Runtime: OpenJDK Runtime Environment
+JVM: OpenJDK 64-Bit Server VM
+Default Encoding: UTF-8
+Language: en
+Country: US
+--></g></svg>
\ No newline at end of file
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index 22a9f9fe..3fae95b7 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -36,6 +36,7 @@ pages:
     - 'Developers Guide':
         - 'Architecture Overview' : 'develop/overview.md'
         - 'Database Layout' : 'develop/Database-Layout.md'
+        - 'Indexing' : 'develop/Indexing.md'
         - 'Tokenizers' : 'develop/Tokenizers.md'
         - 'Setup for Development' : 'develop/Development-Environment.md'
         - 'Testing' : 'develop/Testing.md'
-- 
2.39.5