python - regular expressions: extract text between two markers -
i'm trying write python parser extract information html-pages.
it should extract text between <p itemprop="xxx"> , </p>
i use regular expression:
m = re.search(ur'p>(?p<text>[^<]*)</p>', html) but can't parse file if tags between them. example:
<p itemprop="xxx"> text <br/> text </p> as understood [^<] exception 1 symbol. how write "everything except </p>" ?
you can use:
m = re.search(ur'p>(?p<text>.*?)</p>', html) this lazy match, match until </p>. should consider using html parser beautifulsoup which, after installation, can used css selectors this:
from bs4 import beautifulsoup soup = beautifulsoup(html) m = soup.select('p[itemprop="xxx"]')
Comments
Post a Comment