using nltk regex example in scikit-learn CountVectorizer -

March 15, 2012

i trying use example nltk book regex pattern inside countvectorizer scikit-learn. see examples simple regex not this:

pattern = r''' (?x)         # set flag allow verbose regexps      ([a-z]\.)+          # abbreviations (e.g. u.s.a.)     | \w+(-\w+)*        # words optional internal hyphens     | \$?\d+(\.\d+)?%?  # currency & percentages     | \.\.\.            # ellipses '''  text = 'i love n.y.c. 100% of traffic-ridden streets...' vectorizer = countvectorizer(stop_words='english',token_pattern=pattern) analyze = vectorizer.build_analyzer() analyze(text)

this produces:

[(u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'', u''),  (u'', u'-ridden', u''),  (u'', u'', u''),  (u'', u'', u'')]

with nltk, entirely different:

nltk.regexp_tokenize(text,pattern)

['i', 'love', 'n.y.c.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']

is there way skl countvectorizer output same thing? hoping use of other handy features incorporated in same function call.

tl;dr

from functools import partial countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

is vectorizer uses nltk tokenizer.

now actual problem: apparently nltk.regexp_tokenize quite special pattern, whereas scikit-learn re.findall pattern give it, , findall doesn't pattern:

in [33]: re.findall(pattern, text) out[33]:  [('', '', ''),  ('', '', ''),  ('c.', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '', ''),  ('', '-ridden', ''),  ('', '', ''),  ('', '', '')]

you'll either have rewrite pattern make work in scikit-learn style, or plug nltk tokenizer scikit-learn:

in [41]: functools import partial  in [42]: v = countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))  in [43]: v.build_analyzer()(text) out[43]:  ['i',  'love',  'n.y.c.',  '100',  'even',  'with',  'all',  'of',  'its',  'traffic-ridden',  'streets',  '...']

Search This Blog

O9

using nltk regex example in scikit-learn CountVectorizer -

Comments

Post a Comment

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

Error while updating a record in APEX screen -

ios - Xcode 5 "No such file or directory" -