perl - Encoding broken after using HTML::TreeBuilder as_HTML -
suppose, have following files:
test.html
<!doctype html> <html> <head> <title>Евгений Онегин</title> <meta charset="utf-8"> </head> <body> <p><cite>Евгений Онегин</cite></p> <pre> Не мысля гордый свет забавить, Вниманье дружбы возлюбя, Хотел бы я тебе представить Залог достойнее тебя, </pre> </body> </html>
i wanted contents of body tag in html format, using parser:
<p><cite>Евгений Онегин</cite></p> <pre> Не мысля гордый свет забавить, Вниманье дружбы возлюбя, Хотел бы я тебе представить Залог достойнее тебя, </pre>
parser.pl
#!/usr/bin/env perl use strict; use warnings; use 5.010; use utf8; use html::treebuilder; $root = html::treebuilder->new; $root->parse_file('test.html'); $body = $root->find('body'); print $body->as_html;
when saved output html file , watched in browser unicode, encoding broken: instead of "Евгений Онегин" "Евгений Онегин".
correct work
when html stored inside perl file, works correctly:
#!/usr/bin/env perl use strict; use warnings; use 5.010; use utf8; use data::dumper; use html::treebuilder; $root = html::treebuilder->new; $root->parse_file(\*data); $body = $root->find('body'); print $body->as_html; __end__ <!doctype html> <html> <head> <title>Евгений Онегин</title> <meta charset="utf-8"> </head> <body> <p><cite>Евгений Онегин</cite></p> <pre> Не мысля гордый свет забавить, Вниманье дружбы возлюбя, Хотел бы я тебе представить Залог достойнее тебя, </pre> </body> </html>
so, error occurs, when html::treebuilder reading file.
questions:
- how fix encoding?
- the module encoding every russian character entity:
Е
. possible save characterЕ
?
the parse_file
method take either file name or file handle, simplest solution open file open
call using :utf8
mode, , pass file handle parsed.
it looks this. have used new_from_file
constructor because saves statement. has same effect own code.
#!/usr/bin/env perl use strict; use warnings; use 5.010; use utf8; use html::treebuilder; $file = 'test.html'; open $fh, '<:utf8', $file or die qq{unable open "$file" parsing: $!}; $root = html::treebuilder->new_from_file($fh); $body = $root->find('body'); print $body->as_html;
as changing entities letters, i'm not clear mean. want remove hex entities , replace them equivalent character? may mileage out of html::entities
module.
Comments
Post a Comment