using nltk regex example in scikit-learn CountVectorizer -


i trying use example nltk book regex pattern inside countvectorizer scikit-learn. see examples simple regex not this:

pattern = r''' (?x)         # set flag allow verbose regexps      ([a-z]\.)+          # abbreviations (e.g. u.s.a.)     | \w+(-\w+)*        # words optional internal hyphens     | \$?\d+(\.\d+)?%?  # currency & percentages     | \.\.\.            # ellipses '''  text = 'i love n.y.c. 100% of traffic-ridden streets...' vectorizer = countvectorizer(stop_words='english',token_pattern=pattern) analyze = vectorizer.build_analyzer() analyze(text) 

this produces:

[(u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'-ridden', u''),  (u'', u'', u''),  (u'', u'', u'')] 

with nltk, entirely different:

nltk.regexp_tokenize(text,pattern) 

['i', 'love', 'n.y.c.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']

is there way skl countvectorizer output same thing? hoping use of other handy features incorporated in same function call.

tl;dr

from functools import partial countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern)) 

is vectorizer uses nltk tokenizer.

now actual problem: apparently nltk.regexp_tokenize quite special pattern, whereas scikit-learn re.findall pattern give it, , findall doesn't pattern:

in [33]: re.findall(pattern, text) out[33]:  [('', '', ''),  ('', '', ''),  ('c.', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '-ridden', ''),  ('', '', ''),  ('', '', '')] 

you'll either have rewrite pattern make work in scikit-learn style, or plug nltk tokenizer scikit-learn:

in [41]: functools import partial  in [42]: v = countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))  in [43]: v.build_analyzer()(text) out[43]:  ['i',  'love',  'n.y.c.',  '100',  'even',  'with',  'all',  'of',  'its',  'traffic-ridden',  'streets',  '...'] 

Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -