python - regular expressions: extract text between two markers -

python - regular expressions: extract text between two markers -

August 15, 2013

i'm trying write python parser extract information html-pages.

it should extract text between <p itemprop="xxx"> , </p>

i use regular expression:

m = re.search(ur'p>(?p<text>[^<]*)</p>', html)

but can't parse file if tags between them. example:

<p itemprop="xxx"> text <br/> text </p>

as understood [^<] exception 1 symbol. how write "everything except </p>" ?

you can use:

m = re.search(ur'p>(?p<text>.*?)</p>', html)

this lazy match, match until </p>. should consider using html parser beautifulsoup which, after installation, can used css selectors this:

from bs4 import beautifulsoup soup = beautifulsoup(html) m = soup.select('p[itemprop="xxx"]')

Comments