python - regular expressions: extract text between two markers -


i'm trying write python parser extract information html-pages.

it should extract text between <p itemprop="xxx"> , </p>

i use regular expression:

m = re.search(ur'p>(?p<text>[^<]*)</p>', html) 

but can't parse file if tags between them. example:

<p itemprop="xxx"> text <br/> text </p> 

as understood [^<] exception 1 symbol. how write "everything except </p>" ?

you can use:

m = re.search(ur'p>(?p<text>.*?)</p>', html) 

this lazy match, match until </p>. should consider using html parser beautifulsoup which, after installation, can used css selectors this:

from bs4 import beautifulsoup soup = beautifulsoup(html) m = soup.select('p[itemprop="xxx"]') 

Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -