From: Sarah Hoffmann Date: Thu, 20 Jan 2022 14:49:32 +0000 (+0100) Subject: complete documentation for new clean-houseunubmers sanatizer X-Git-Tag: v4.1.0~94^2~2 X-Git-Url: https://git.openstreetmap.org./nominatim.git/commitdiff_plain/f3c9578bcaf8a1981b160b14809e9dc1377cfb37 complete documentation for new clean-houseunubmers sanatizer --- diff --git a/docs/customize/Tokenizers.md b/docs/customize/Tokenizers.md index 5c766f50..f75bc6a5 100644 --- a/docs/customize/Tokenizers.md +++ b/docs/customize/Tokenizers.md @@ -181,6 +181,13 @@ The following is a list of sanitizers that are shipped with Nominatim. rendering: heading_level: 6 +##### clean-housenumbers + +::: nominatim.tokenizer.sanitizers.clean_housenumbers + selection: + members: False + rendering: + heading_level: 6 #### Token Analysis diff --git a/nominatim/tokenizer/sanitizers/clean_housenumbers.py b/nominatim/tokenizer/sanitizers/clean_housenumbers.py index 9777a7fc..85af903b 100644 --- a/nominatim/tokenizer/sanitizers/clean_housenumbers.py +++ b/nominatim/tokenizer/sanitizers/clean_housenumbers.py @@ -5,7 +5,11 @@ # Copyright (C) 2022 by the Nominatim developer community. # For a full list of authors see the git log. """ -Sanitizer that cleans and normalizes house numbers. +Sanitizer that preprocesses address tags for house numbers. The sanitizer +allows to + +* define which tags are to be considered house numbers (see 'filter-kind') +* split house number lists into individual numbers (see 'delimiters') Arguments: delimiters: Define the set of characters to be used for diff --git a/settings/icu_tokenizer.yaml b/settings/icu_tokenizer.yaml index d00cffb9..bf51f563 100644 --- a/settings/icu_tokenizer.yaml +++ b/settings/icu_tokenizer.yaml @@ -28,6 +28,10 @@ sanitizers: - step: split-name-list - step: strip-brace-terms - step: clean-housenumbers + filter-kind: + - housenumber + - conscriptionnumber + - streetnumber - step: tag-analyzer-by-language filter-kind: [".*name.*"] whitelist: [bg,ca,cs,da,de,el,en,es,et,eu,fi,fr,gl,hu,it,ja,mg,ms,nl,no,pl,pt,ro,ru,sk,sl,sv,tr,uk,vi]