python - Parsing a long html using BeautifulSoup failed with half parsed output -
i used following script parse fund price of particular fund:
import pandas pd bs4 import beautifulsoup ghost import ghost ghost = ghost() page,resources = ghost.open('http://bank.hangseng.com/1/pa_1_1_p1/comsvlet_minisite_eng_gif?app=einvcfunddetailsov&pri_fund_code=u44217') page,resources = ghost.evaluate("agree()", expect_loading=true) page,resources = ghost.evaluate("mm_changeview('einvcfundpricedividend')", expect_loading=true) # ghost.capture_to("hangseng.png") soup = beautifulsoup(page.content) soup
the output soup
ok first half, tag turned in uppercase , beautifulsoup cannot parse them, 1 below:
<td class="lightgrey" valign="top"><font class="content">22-07-2014</font></td><td class="lightgrey" valign="top"><font class="content">10.95000</font></td><td class="lightgrey" valign="top"><font class="content">11.39000</font></td><td class="lightgrey" valign="top"><font class="content">10.95000</font></td> </tr> t r v l g n = " t o p " l g n = " c e n t e r " > t d c l s s = " l g h t g r e y " v l g n = " t o p " > f o n t c l s s = " c o n t e n t " > 2 1 - 0 7 - 2 0 1 4 / f o n t > / t d > t d c l s s = " l g h t g r e y " v l g n = " t o p " > f o n t c l s s = " c o n t e n t " > 1 0 . 9 6 0 0 0 / f o n t > / t d > t d c l s s = " l g h t g r e y " v l g n = " t o p " > f o n t c l s s = " c o n t e n t " > 1 1 . 4 0 0 0 0 / f o n t > / t d > t d c l s s = " l g h t g r e y " v l g n = " t o p " > f o n t c l s s = " c o n t e n t " > 1 0 . 9 6 0 0 0 / f o n t > / t d > / t r >
you can see output becomes garbage after date 2014-07-22
.
what happened?
i found solution spaced output beautifulsoup
page.content soup = beautifulsoup(page.content,'html.parser')
now works perfectly.
Comments
Post a Comment