python - why does zip truncate the data in pyspark? -


i'm experiencing strange behavior using zip; i'm trying have rdd of key-value pairs value index, e.g. initialize rdd 'f':

f = sc.parallelize(tokenizer('a fox jumped on rabbit')).flatmap(lambda x: ngrams(x))  f.count() 52 

and do:

ind = sc.parallelize(range(f.count())) ind.count() 52 

but

f_ind = f.zip(ind) f_ind.count() 48  

i don't understand why elements being lost?

the problem sparkrdd zip operation requires 2 rdds have same number of elements and same number of elements per partition. last requirement violated in case above. there doesn't seem fix (but see e.g. http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip).


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

python - Django-cities exits with "killed" -

python - How to get a widget position inside it's layout in Kivy? -