hadoop - Spark job taking a long time to finish for levenshtin algorithm processing 200MB in EMR 12 node cluster? -


i have 12 node cluster there 8 virtual cores , 12gb memory in aws emr.data proccessed near 200mb against 30 mb file.i applying levenshtein algorithm on data. command used

nohup spark-submit --num-executors 48 --executor-cores 2 --executor-memory 2g textmatcher.py > outputcluster &

code used is:

exportrdd = sc.parallelize(exportlist) matchrdd = exportrdd.map(lambda x : (x,process.extractone(x, registerlist)[0])) outputrdd = matchrdd.map(lambda (x,y): (exportnamedict[x],establismentid[y],registerednamesdict[y])) outputrdd.saveastextfile(output_file) 

can tell me how optimize it?please let me know if other information required.


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

Python Pig Latin Translator -