python - regular expressions: extract text between two markers -
i'm trying write python parser extract information html-pages.
it should extract text between <p itemprop="xxx">
, </p>
i use regular expression:
m = re.search(ur'p>(?p<text>[^<]*)</p>', html)
but can't parse file if tags between them. example:
<p itemprop="xxx"> text <br/> text </p>
as understood [^<]
exception 1 symbol. how write "everything except </p>
" ?
you can use:
m = re.search(ur'p>(?p<text>.*?)</p>', html)
this lazy match, match until </p>
. should consider using html parser beautifulsoup which, after installation, can used css selectors this:
from bs4 import beautifulsoup soup = beautifulsoup(html) m = soup.select('p[itemprop="xxx"]')
Comments
Post a Comment