using nltk regex example in scikit-learn CountVectorizer -
i trying use example nltk book regex pattern inside countvectorizer scikit-learn. see examples simple regex not this:
pattern = r''' (?x) # set flag allow verbose regexps ([a-z]\.)+ # abbreviations (e.g. u.s.a.) | \w+(-\w+)* # words optional internal hyphens | \$?\d+(\.\d+)?%? # currency & percentages | \.\.\. # ellipses ''' text = 'i love n.y.c. 100% of traffic-ridden streets...' vectorizer = countvectorizer(stop_words='english',token_pattern=pattern) analyze = vectorizer.build_analyzer() analyze(text)
this produces:
[(u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'', u''), (u'', u'-ridden', u''), (u'', u'', u''), (u'', u'', u'')]
with nltk, entirely different:
nltk.regexp_tokenize(text,pattern)
['i', 'love', 'n.y.c.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']
is there way skl countvectorizer output same thing? hoping use of other handy features incorporated in same function call.
tl;dr
from functools import partial countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
is vectorizer uses nltk tokenizer.
now actual problem: apparently nltk.regexp_tokenize
quite special pattern, whereas scikit-learn re.findall
pattern give it, , findall
doesn't pattern:
in [33]: re.findall(pattern, text) out[33]: [('', '', ''), ('', '', ''), ('c.', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '-ridden', ''), ('', '', ''), ('', '', '')]
you'll either have rewrite pattern make work in scikit-learn style, or plug nltk tokenizer scikit-learn:
in [41]: functools import partial in [42]: v = countvectorizer(analyzer=partial(regexp_tokenize, pattern=pattern)) in [43]: v.build_analyzer()(text) out[43]: ['i', 'love', 'n.y.c.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']
Comments
Post a Comment