scrapy: how to have a response parsed by multiple parser functions? -


i'd special each 1 of landing urls in start_urls, , spider'd follow nextpages , crawl deeper. code's this:

def init_parse(self, response):     item = myitem()      # extract info landing url , populate item fields here...      yield self.newly_parse(response)     yield item     return  parse_start_url = init_parse  def newly_parse(self, response):     item = myitem2()     newly_soup = beautifulsoup(response.body)      # parse, return or yield items      return item 

the code won't work because spider allows return item, request or none yield self.newly_parse, how can achieve in scrapy?

my not elegant solution:

put init_parse function inside newly_parse , implement is_start_url check in beginning, if response.url inside start_urls, we'll go through init_parse procedure.

another ugly solution

separate out code # parse, return or yield items happens , make class method or generator, , call method or generator both inside init_parse , newly_parse.

if you're going yield multiple items under newly_parse line under init_parse should be:

for item in self.newly_parse(response):     yield item 

as self.newly_parse return generator need iterate through first scrapy won't recognize it.


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

python - Django-cities exits with "killed" -

python - How to get a widget position inside it's layout in Kivy? -