scrapy: how to have a response parsed by multiple parser functions? -
i'd special each 1 of landing urls in start_urls
, , spider'd follow nextpages , crawl deeper. code's this:
def init_parse(self, response): item = myitem() # extract info landing url , populate item fields here... yield self.newly_parse(response) yield item return parse_start_url = init_parse def newly_parse(self, response): item = myitem2() newly_soup = beautifulsoup(response.body) # parse, return or yield items return item
the code won't work because spider allows return item, request or none yield self.newly_parse
, how can achieve in scrapy?
my not elegant solution:
put init_parse
function inside newly_parse , implement is_start_url
check in beginning, if response.url
inside start_urls
, we'll go through init_parse procedure.
another ugly solution
separate out code # parse, return or yield items
happens , make class method or generator, , call method or generator both inside init_parse
, newly_parse
.
if you're going yield multiple items under newly_parse
line under init_parse
should be:
for item in self.newly_parse(response): yield item
as self.newly_parse
return generator need iterate through first scrapy won't recognize it.
Comments
Post a Comment