Elasticsearch Pattern_capture filter emits a token that is not matched with pattern also -


i have case have extract domain part emails found in text. used uax_url_email tokenizer create emails single. , have pattern_capture filter emit "@(.+)" pattern string. uax_url_email return words not email , pattern capture filter not filter that. suggestions?

"custom_analyzer":{  "tokenizer": "uax_url_email",   "filter": [        "email_domain_filter"    ] } "filter": {   "email_domain_filter":{            "type": "pattern_capture",            "preserve_original": false,             "patterns": [                       "@(.+)"               ]    } } 

input string : "my email id xyz@gmail.com"

output tokens: my, email, id, is, gmail.com

but need gmail.com

"if none of patterns match, or if preserveoriginal true, original token preserved."

https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/patterncapturegrouptokenfilter.html

try adding pattern matches other tokens not contain capture group (e.g. ".*")


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -