Multithreading in Python/BeautifulSoup scraping doesn't speed up at all -


i have csv file ("somesitevalidurls.csv") listed links need scrape. code working , go through urls in csv, scrape information , record/save in csv file ("output.csv"). however, since planning large portion of site (for >10,000,000 pages), speed important. each link, takes 1s crawl , save info csv, slow magnitude of project. have incorporated multithreading module , surprise doesn't speed @ all, still takes 1s person link. did wrong? there other way speed processing speed?

without multithreading:

import urllib2 import csv bs4 import beautifulsoup import threading  def crawltocsv(filename):      open(filename, "rb") f:         urlrecords in f:              opensomesiteurl = urllib2.urlopen(urlrecords)             soup_somesite = beautifulsoup(opensomesiteurl, "lxml")             opensomesiteurl.close()              tbodytags = soup_somesite.find("tbody")             trtags = tbodytags.find_all("tr", class_="result-item ")              placeholder = []              trtag in trtags:                 tdtags = trtag.find("td", class_="result-value")                 tdtags_string = tdtags.string                 placeholder.append(tdtags_string)              open("output.csv", "ab") f:                 writefile = csv.writer(f)                 writefile.writerow(placeholder)  crawltocsv("somesitevalidurls.csv") 

with multithreading:

import urllib2 import csv bs4 import beautifulsoup import threading  def crawltocsv(filename):      open(filename, "rb") f:         urlrecords in f:              opensomesiteurl = urllib2.urlopen(urlrecords)             soup_somesite = beautifulsoup(opensomesiteurl, "lxml")             opensomesiteurl.close()              tbodytags = soup_somesite.find("tbody")             trtags = tbodytags.find_all("tr", class_="result-item ")              placeholder = []              trtag in trtags:                 tdtags = trtag.find("td", class_="result-value")                 tdtags_string = tdtags.string                 placeholder.append(tdtags_string)              open("output.csv", "ab") f:                 writefile = csv.writer(f)                 writefile.writerow(placeholder)  filename = "somesitevalidurls.csv"  if __name__ == "__main__":     t = threading.thread(target=crawltocsv, args=(filename, ))     t.start()     t.join() 

you're not parallelizing properly. want have work being done inside loop happen concurrently across many workers. right you're moving all work 1 background thread, whole thing synchronously. that's not going improve performance @ (it hurt it, actually).

here's example uses threadpool parallelize network operation , parsing. it's not safe try write csv file across many threads @ once, instead return data have been written parent, , have parent write results file @ end.

import urllib2 import csv bs4 import beautifulsoup multiprocessing.dummy import pool  # thread-based pool multiprocessing import cpu_count  def crawltocsv(urlrecord):     opensomesiteurl = urllib2.urlopen(urlrecord)     soup_somesite = beautifulsoup(opensomesiteurl, "lxml")     opensomesiteurl.close()      tbodytags = soup_somesite.find("tbody")     trtags = tbodytags.find_all("tr", class_="result-item ")      placeholder = []      trtag in trtags:         tdtags = trtag.find("td", class_="result-value")         tdtags_string = tdtags.string         placeholder.append(tdtags_string)      return placeholder   if __name__ == "__main__":     filename = "somesitevalidurls.csv"     pool = pool(cpu_count() * 2)  # creates pool cpu_count * 2 threads.     open(filename, "rb") f:         results = pool.map(crawltocsv, f)  # results list of placeholder lists returned each call crawltocsv     open("output.csv", "ab") f:         writefile = csv.writer(f)         result in results:             writefile.writerow(result) 

note in python, threads speed i/o operations - because of gil, cpu-bound operations (like parsing/searching beautifulsoup doing) can't done in parallel via threads, because 1 thread can cpu-based operations @ time. still may not see speed hoping approach. when need speed cpu-bound operations in python, need use multiple processes instead of threads. luckily, can see how script performs multiple processes instead of multiple threads; change from multiprocessing.dummy import pool from multiprocessing import pool. no other changes required.

edit:

if need scale file 10,000,000 lines, you're going need adjust code bit - pool.map converts iterable pass list prior sending off workers, isn't going work 10,000,000 entry list; having whole thing in memory going bog down system. same issue storing results in list. instead, should use pool.imap:

imap(func, iterable[, chunksize])

a lazier version of map().

the chunksize argument same 1 used map() method. long iterables using large value chunksize can make job complete faster using default value of 1.

if __name__ == "__main__":     filename = "somesitevalidurls.csv"     file_lines = 10,000,000     num_workers = cpu_count() * 2     chunksize = file_lines // num_workers * 4  # try chunksize. you're going have tweak this, though. try smaller , lower values , see how performance changes.     pool = pool(num_workers)      open(filename, "rb") f:         result_iter = pool.imap(crawltocsv, f)     open("output.csv", "ab") f:         writefile = csv.writer(f)         result in result_iter:  # lazily iterate on results.             writefile.writerow(result) 

with imap, never put of f memory @ once, nor store results in memory @ once. ever have in memory chunksize lines of f, should more manageable.


Comments

Popular posts from this blog

java - How to specify maven bin in eclipse maven plugin? -

single sign on - Logging into Plone site with credentials passed through HTTP -

php - Why does AJAX not process login form? -