python - Training NLTK Brill tagger but using a txt file as an input -


hye everyone. i'm doing final year project named "part-of-speech tagger malay language using brill tagger".

i want ask how train tagged sentences have save in txt file? input should in txt files being train using brill tagger. after that, used txt file test data. but, stuck on train part.can me?

here of codes.

import nltk   f = open('gayahidupsihat_tagged.txt')   malay_tagged = f.read()     def train_brill_tagger(train_data):     # modules creating templates.     nltk.tag import unigramtagger     nltk.tag.brill import symmetricproximatetokenstemplate, proximatetokenstemplate     nltk.tag.brill import proximatetagsrule, proximatewordsrule     # brill tagger module in nltk.     nltk.tag.brill import fastbrilltaggertrainer     unigram_tagger = unigramtagger(train_data)     templates = [symmetricproximatetokenstemplate(proximatetagsrule, (1,1)),                  symmetricproximatetokenstemplate(proximatetagsrule, (2,2)),                  symmetricproximatetokenstemplate(proximatetagsrule, (1,2)),                  symmetricproximatetokenstemplate(proximatetagsrule, (1,3)),                  symmetricproximatetokenstemplate(proximatewordsrule, (1,1)),                  symmetricproximatetokenstemplate(proximatewordsrule, (2,2)),                  symmetricproximatetokenstemplate(proximatewordsrule, (1,2)),                  symmetricproximatetokenstemplate(proximatewordsrule, (1,3)),                  proximatetokenstemplate(proximatetagsrule, (-1, -1), (1,1)),                  proximatetokenstemplate(proximatewordsrule, (-1, -1), (1,1))]      trainer = fastbrilltaggertrainer(initial_tagger=unigram_tagger,                                    templates=templates, trace=3,                                    deterministic=true)     brill_tagger = trainer.train(train_data, max_rules=10)     print     return brill_tagger      malay_train = (malay_tagged[:10])  malay_test = (malay_tagged[10:15])  malay20 = malay_tagged[20]  mt = train_brill_tagger(malay_train)     print mt.tag(malay20) 

actually, want train tagged paragraph, after that, test using other paragraph. after that, use tagged sentences evaluate brill tagger result.

example :

i train (gayahidupsihat_train.txt) -- 1 line of input really:

gaya\nn hidup\nn sihat\vb boleh\md lah\uh ditakrifkan\vbz sebagai\dt satu\cd amalan\vbz kehidupan\nn yang\dt membawa\vbz impak\nn positif\nn kepada\to diri\nn seseorang\nn ,\, keluarganya\nn dan\cc masyarakat\nn. antara\in contoh\nn kehidupan\nn yang\dt sihat\vb ialah\dt individu\nn tersebut\ex hidup\vb dengan\dt penuh\rb ceria\rb tanpa\nn mengalami\vbz sebarang\nn masalah\nn yang\dt boleh\md menjejaskan\vbz kehidupannya\nn untuk\to satu\cd tempoh\nn tertentu\ex pula\dt .\. sudah\ex pasti\rb dalam\dt kehidupan\nn era\nn moden\nn yang\dt begitu\dt banyak\rb tekanan\vb ini\dt gaya\nn hidup\nn sihat\vb menjadi\vbz satu\num matlamat\nn yang\dt perlu\md dicapai\vbz segera\vb. oleh\pdt itu\dt ,\, terdapat\ex pelbagai\nn tindakan\vbz yang\dt boleh\md dilakukan\vbz untuk\to mencapai\vbz matlamat\nn ini\dt .\. 

then want test (gayahidupsihat_test.txt):

tindakan\vbp awal\vb ialah\dt seseorang\nn itu\dt perlu\md mengamalkan\vbd satu\cd bentuk\nn pemakanan\nn yang\dt seimbang\nn dalam\in kehidupannya\vbz .\.dalam\in keadaan\nn kehidupan\nn sebenar\jj ,\, orang\nn ramai\jj lebih\jjr suka\vb mengambil\vbz makanan\nn yang\dt bersifat\vbz mudah\jj seperti\dt mengamalkan\vbz pengambilan\vbd makanan\nn ringan\jj ataupun\cc makanan\nn segera\nn .\. tidak\dt kurang\jjr juga\dt masyarakat\nn kita\prp hari\nn ini\dt yang\dt lupa\vb kesan\nn pengambilan\vbz makanan\nn berlemak\jjr ataupun\cc makanan\nn yang\dt mempunyai\vbz kandungan\nn garam\nn ,\. gula\nn atau\dt sodium\fw glutamit\fw yang\dt tinggi\jj .\. hal\in ini\dt boleh\md mendatangkan\vbz pelbagai\nn penyakit\nn kronik\jj seperti\dt sakit\jj jantung\nn ,\, darah\nn tinggi\jj ataupun\cc kencing\nn manis\jj yang\dt juga\dt menjadi\md punca\nn kematian\nn tertinggi\jjs di\in negara\nn kita\prp .\.  

after that, use tagged_words try tagger , evaluate it.

the english version shows output this:

training brill tagger on 500 sentences... finding initial useful rules... found 10210 useful rules.             b      |    s   f   r   o  |        score = fixed - broken    c     o   t  |  r     fixed = num tags changed incorrect -> correct    o   x   k   h  |  u     broken = num tags changed correct -> incorrect    r   e   e   e  |  l     other = num tags changed incorrect -> incorrect    e   d   n   r  |  e ------------------+-------------------------------------------------------   46  46   0   0  | -> in if tag of following word 'at'   18  20   2   0  | -> in if tag of words i+1...i+3 'cd'   14  14   0   0  | in -> in-tl if tag of preceding word                   |   'nn-tl', , tag of following word                   |   'nn-tl'   11  11   0   1  | -> in if tag of following word 'nns'   10  10   0   0  | -> in if tag of following word 'jj'    8   8   0   0  | , -> ,-hl if tag of preceding word 'np-                   |   hl'    7   7   0   1  | nn -> vb if tag of preceding word 'md'    7  13   6   0  | nn -> vb if tag of preceding word 'to'    7   7   0   0  | np-tl -> np if tag of words i+1...i+2 'nns'    7   7   0   0  | vbn -> vbd if tag of preceding word                   |   'np'` 

you need parse input files (both train , test) format nltk toolchain recognizes: file list (or sequence) of sentences, sentence list of tagged words, , tagged word tuple of 2 strings, (word, tag). in code, malay_tagged simple string (i.e., sequence of characters).

it's not hard yourself, nltk's nltk.corpus.reader.taggedcorpusreader can parse file you. make sure tell word-tag separator in file backslash ("\\").


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -