python - why does zip truncate the data in pyspark? -
i'm experiencing strange behavior using zip; i'm trying have rdd of key-value pairs value index, e.g. initialize rdd 'f':
f = sc.parallelize(tokenizer('a fox jumped on rabbit')).flatmap(lambda x: ngrams(x)) f.count() 52
and do:
ind = sc.parallelize(range(f.count())) ind.count() 52
but
f_ind = f.zip(ind) f_ind.count() 48
i don't understand why elements being lost?
the problem sparkrdd
zip
operation requires 2 rdds have same number of elements and same number of elements per partition. last requirement violated in case above. there doesn't seem fix (but see e.g. http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip).
Comments
Post a Comment