python - Training NLTK Brill tagger but using a txt file as an input -
hye everyone. i'm doing final year project named "part-of-speech tagger malay language using brill tagger".
i want ask how train tagged sentences have save in txt file? input should in txt files being train using brill tagger. after that, used txt file test data. but, stuck on train part.can me?
here of codes.
import nltk f = open('gayahidupsihat_tagged.txt') malay_tagged = f.read() def train_brill_tagger(train_data): # modules creating templates. nltk.tag import unigramtagger nltk.tag.brill import symmetricproximatetokenstemplate, proximatetokenstemplate nltk.tag.brill import proximatetagsrule, proximatewordsrule # brill tagger module in nltk. nltk.tag.brill import fastbrilltaggertrainer unigram_tagger = unigramtagger(train_data) templates = [symmetricproximatetokenstemplate(proximatetagsrule, (1,1)), symmetricproximatetokenstemplate(proximatetagsrule, (2,2)), symmetricproximatetokenstemplate(proximatetagsrule, (1,2)), symmetricproximatetokenstemplate(proximatetagsrule, (1,3)), symmetricproximatetokenstemplate(proximatewordsrule, (1,1)), symmetricproximatetokenstemplate(proximatewordsrule, (2,2)), symmetricproximatetokenstemplate(proximatewordsrule, (1,2)), symmetricproximatetokenstemplate(proximatewordsrule, (1,3)), proximatetokenstemplate(proximatetagsrule, (-1, -1), (1,1)), proximatetokenstemplate(proximatewordsrule, (-1, -1), (1,1))] trainer = fastbrilltaggertrainer(initial_tagger=unigram_tagger, templates=templates, trace=3, deterministic=true) brill_tagger = trainer.train(train_data, max_rules=10) print return brill_tagger malay_train = (malay_tagged[:10]) malay_test = (malay_tagged[10:15]) malay20 = malay_tagged[20] mt = train_brill_tagger(malay_train) print mt.tag(malay20)
actually, want train tagged paragraph, after that, test using other paragraph. after that, use tagged sentences evaluate brill tagger result.
example :
i train (gayahidupsihat_train.txt
) -- 1 line of input really:
gaya\nn hidup\nn sihat\vb boleh\md lah\uh ditakrifkan\vbz sebagai\dt satu\cd amalan\vbz kehidupan\nn yang\dt membawa\vbz impak\nn positif\nn kepada\to diri\nn seseorang\nn ,\, keluarganya\nn dan\cc masyarakat\nn. antara\in contoh\nn kehidupan\nn yang\dt sihat\vb ialah\dt individu\nn tersebut\ex hidup\vb dengan\dt penuh\rb ceria\rb tanpa\nn mengalami\vbz sebarang\nn masalah\nn yang\dt boleh\md menjejaskan\vbz kehidupannya\nn untuk\to satu\cd tempoh\nn tertentu\ex pula\dt .\. sudah\ex pasti\rb dalam\dt kehidupan\nn era\nn moden\nn yang\dt begitu\dt banyak\rb tekanan\vb ini\dt gaya\nn hidup\nn sihat\vb menjadi\vbz satu\num matlamat\nn yang\dt perlu\md dicapai\vbz segera\vb. oleh\pdt itu\dt ,\, terdapat\ex pelbagai\nn tindakan\vbz yang\dt boleh\md dilakukan\vbz untuk\to mencapai\vbz matlamat\nn ini\dt .\.
then want test (gayahidupsihat_test.txt
):
tindakan\vbp awal\vb ialah\dt seseorang\nn itu\dt perlu\md mengamalkan\vbd satu\cd bentuk\nn pemakanan\nn yang\dt seimbang\nn dalam\in kehidupannya\vbz .\.dalam\in keadaan\nn kehidupan\nn sebenar\jj ,\, orang\nn ramai\jj lebih\jjr suka\vb mengambil\vbz makanan\nn yang\dt bersifat\vbz mudah\jj seperti\dt mengamalkan\vbz pengambilan\vbd makanan\nn ringan\jj ataupun\cc makanan\nn segera\nn .\. tidak\dt kurang\jjr juga\dt masyarakat\nn kita\prp hari\nn ini\dt yang\dt lupa\vb kesan\nn pengambilan\vbz makanan\nn berlemak\jjr ataupun\cc makanan\nn yang\dt mempunyai\vbz kandungan\nn garam\nn ,\. gula\nn atau\dt sodium\fw glutamit\fw yang\dt tinggi\jj .\. hal\in ini\dt boleh\md mendatangkan\vbz pelbagai\nn penyakit\nn kronik\jj seperti\dt sakit\jj jantung\nn ,\, darah\nn tinggi\jj ataupun\cc kencing\nn manis\jj yang\dt juga\dt menjadi\md punca\nn kematian\nn tertinggi\jjs di\in negara\nn kita\prp .\.
after that, use tagged_words
try tagger , evaluate it.
the english version shows output this:
training brill tagger on 500 sentences... finding initial useful rules... found 10210 useful rules. b | s f r o | score = fixed - broken c o t | r fixed = num tags changed incorrect -> correct o x k h | u broken = num tags changed correct -> incorrect r e e e | l other = num tags changed incorrect -> incorrect e d n r | e ------------------+------------------------------------------------------- 46 46 0 0 | -> in if tag of following word 'at' 18 20 2 0 | -> in if tag of words i+1...i+3 'cd' 14 14 0 0 | in -> in-tl if tag of preceding word | 'nn-tl', , tag of following word | 'nn-tl' 11 11 0 1 | -> in if tag of following word 'nns' 10 10 0 0 | -> in if tag of following word 'jj' 8 8 0 0 | , -> ,-hl if tag of preceding word 'np- | hl' 7 7 0 1 | nn -> vb if tag of preceding word 'md' 7 13 6 0 | nn -> vb if tag of preceding word 'to' 7 7 0 0 | np-tl -> np if tag of words i+1...i+2 'nns' 7 7 0 0 | vbn -> vbd if tag of preceding word | 'np'`
you need parse input files (both train , test) format nltk toolchain recognizes: file list (or sequence) of sentences, sentence list of tagged words, , tagged word tuple of 2 strings, (word, tag)
. in code, malay_tagged
simple string (i.e., sequence of characters).
it's not hard yourself, nltk's nltk.corpus.reader.taggedcorpusreader
can parse file you. make sure tell word-tag separator in file backslash ("\\"
).
Comments
Post a Comment