Multithreading in Python/BeautifulSoup scraping doesn't speed up at all -
i have csv file ("somesitevalidurls.csv") listed links need scrape. code working , go through urls in csv, scrape information , record/save in csv file ("output.csv"). however, since planning large portion of site (for >10,000,000 pages), speed important. each link, takes 1s crawl , save info csv, slow magnitude of project. have incorporated multithreading module , surprise doesn't speed @ all, still takes 1s person link. did wrong? there other way speed processing speed?
without multithreading:
import urllib2 import csv bs4 import beautifulsoup import threading def crawltocsv(filename): open(filename, "rb") f: urlrecords in f: opensomesiteurl = urllib2.urlopen(urlrecords) soup_somesite = beautifulsoup(opensomesiteurl, "lxml") opensomesiteurl.close() tbodytags = soup_somesite.find("tbody") trtags = tbodytags.find_all("tr", class_="result-item ") placeholder = [] trtag in trtags: tdtags = trtag.find("td", class_="result-value") tdtags_string = tdtags.string placeholder.append(tdtags_string) open("output.csv", "ab") f: writefile = csv.writer(f) writefile.writerow(placeholder) crawltocsv("somesitevalidurls.csv") with multithreading:
import urllib2 import csv bs4 import beautifulsoup import threading def crawltocsv(filename): open(filename, "rb") f: urlrecords in f: opensomesiteurl = urllib2.urlopen(urlrecords) soup_somesite = beautifulsoup(opensomesiteurl, "lxml") opensomesiteurl.close() tbodytags = soup_somesite.find("tbody") trtags = tbodytags.find_all("tr", class_="result-item ") placeholder = [] trtag in trtags: tdtags = trtag.find("td", class_="result-value") tdtags_string = tdtags.string placeholder.append(tdtags_string) open("output.csv", "ab") f: writefile = csv.writer(f) writefile.writerow(placeholder) filename = "somesitevalidurls.csv" if __name__ == "__main__": t = threading.thread(target=crawltocsv, args=(filename, )) t.start() t.join()
you're not parallelizing properly. want have work being done inside loop happen concurrently across many workers. right you're moving all work 1 background thread, whole thing synchronously. that's not going improve performance @ (it hurt it, actually).
here's example uses threadpool parallelize network operation , parsing. it's not safe try write csv file across many threads @ once, instead return data have been written parent, , have parent write results file @ end.
import urllib2 import csv bs4 import beautifulsoup multiprocessing.dummy import pool # thread-based pool multiprocessing import cpu_count def crawltocsv(urlrecord): opensomesiteurl = urllib2.urlopen(urlrecord) soup_somesite = beautifulsoup(opensomesiteurl, "lxml") opensomesiteurl.close() tbodytags = soup_somesite.find("tbody") trtags = tbodytags.find_all("tr", class_="result-item ") placeholder = [] trtag in trtags: tdtags = trtag.find("td", class_="result-value") tdtags_string = tdtags.string placeholder.append(tdtags_string) return placeholder if __name__ == "__main__": filename = "somesitevalidurls.csv" pool = pool(cpu_count() * 2) # creates pool cpu_count * 2 threads. open(filename, "rb") f: results = pool.map(crawltocsv, f) # results list of placeholder lists returned each call crawltocsv open("output.csv", "ab") f: writefile = csv.writer(f) result in results: writefile.writerow(result) note in python, threads speed i/o operations - because of gil, cpu-bound operations (like parsing/searching beautifulsoup doing) can't done in parallel via threads, because 1 thread can cpu-based operations @ time. still may not see speed hoping approach. when need speed cpu-bound operations in python, need use multiple processes instead of threads. luckily, can see how script performs multiple processes instead of multiple threads; change from multiprocessing.dummy import pool from multiprocessing import pool. no other changes required.
edit:
if need scale file 10,000,000 lines, you're going need adjust code bit - pool.map converts iterable pass list prior sending off workers, isn't going work 10,000,000 entry list; having whole thing in memory going bog down system. same issue storing results in list. instead, should use pool.imap:
imap(func, iterable[, chunksize])
a lazier version of map().
the chunksize argument same 1 used map() method. long iterables using large value chunksize can make job complete faster using default value of 1.
if __name__ == "__main__": filename = "somesitevalidurls.csv" file_lines = 10,000,000 num_workers = cpu_count() * 2 chunksize = file_lines // num_workers * 4 # try chunksize. you're going have tweak this, though. try smaller , lower values , see how performance changes. pool = pool(num_workers) open(filename, "rb") f: result_iter = pool.imap(crawltocsv, f) open("output.csv", "ab") f: writefile = csv.writer(f) result in result_iter: # lazily iterate on results. writefile.writerow(result) with imap, never put of f memory @ once, nor store results in memory @ once. ever have in memory chunksize lines of f, should more manageable.
Comments
Post a Comment