scrapy: how to have a response parsed by multiple parser functions? -
i'd special each 1 of landing urls in start_urls, , spider'd follow nextpages , crawl deeper. code's this:
def init_parse(self, response): item = myitem() # extract info landing url , populate item fields here... yield self.newly_parse(response) yield item return parse_start_url = init_parse def newly_parse(self, response): item = myitem2() newly_soup = beautifulsoup(response.body) # parse, return or yield items return item the code won't work because spider allows return item, request or none yield self.newly_parse, how can achieve in scrapy?
my not elegant solution:
put init_parse function inside newly_parse , implement is_start_url check in beginning, if response.url inside start_urls, we'll go through init_parse procedure.
another ugly solution
separate out code # parse, return or yield items happens , make class method or generator, , call method or generator both inside init_parse , newly_parse.
if you're going yield multiple items under newly_parse line under init_parse should be:
for item in self.newly_parse(response): yield item as self.newly_parse return generator need iterate through first scrapy won't recognize it.
Comments
Post a Comment