perl - Encoding broken after using HTML::TreeBuilder as_HTML -


suppose, have following files:

test.html

<!doctype html> <html>   <head>     <title>Евгений Онегин</title>     <meta charset="utf-8">   </head>   <body>     <p><cite>Евгений Онегин</cite></p>     <pre>       Не мысля гордый свет забавить,       Вниманье дружбы возлюбя,       Хотел бы я тебе представить       Залог достойнее тебя,     </pre> </body> </html> 

i wanted contents of body tag in html format, using parser:

<p><cite>Евгений Онегин</cite></p> <pre>   Не мысля гордый свет забавить,   Вниманье дружбы возлюбя,   Хотел бы я тебе представить   Залог достойнее тебя, </pre> 

parser.pl

#!/usr/bin/env perl  use strict; use warnings; use 5.010; use utf8;  use html::treebuilder;  $root = html::treebuilder->new; $root->parse_file('test.html');  $body = $root->find('body'); print $body->as_html; 

when saved output html file , watched in browser unicode, encoding broken: instead of "Евгений Онегин" "Евгений Онегин".

correct work

when html stored inside perl file, works correctly:

#!/usr/bin/env perl  use strict; use warnings; use 5.010; use utf8;  use data::dumper; use html::treebuilder;  $root = html::treebuilder->new; $root->parse_file(\*data);  $body = $root->find('body'); print $body->as_html;  __end__ <!doctype html> <html>   <head>     <title>Евгений Онегин</title>     <meta charset="utf-8">   </head>   <body>     <p><cite>Евгений Онегин</cite></p>     <pre>       Не мысля гордый свет забавить,       Вниманье дружбы возлюбя,       Хотел бы я тебе представить       Залог достойнее тебя,     </pre> </body> </html> 

so, error occurs, when html::treebuilder reading file.

questions:

  1. how fix encoding?
  2. the module encoding every russian character entity: &#x415;. possible save character Е?

the parse_file method take either file name or file handle, simplest solution open file open call using :utf8 mode, , pass file handle parsed.

it looks this. have used new_from_file constructor because saves statement. has same effect own code.

#!/usr/bin/env perl  use strict; use warnings; use 5.010; use utf8;  use html::treebuilder;  $file = 'test.html';  open $fh, '<:utf8', $file or die qq{unable open "$file" parsing: $!}; $root = html::treebuilder->new_from_file($fh);  $body = $root->find('body'); print $body->as_html; 

as changing entities letters, i'm not clear mean. want remove hex entities , replace them equivalent character? may mileage out of html::entities module.


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -