hadoop - Spark job taking a long time to finish for levenshtin algorithm processing 200MB in EMR 12 node cluster? -
i have 12 node cluster there 8 virtual cores , 12gb memory in aws emr.data proccessed near 200mb against 30 mb file.i applying levenshtein algorithm on data. command used
nohup spark-submit --num-executors 48 --executor-cores 2 --executor-memory 2g textmatcher.py > outputcluster &
code used is:
exportrdd = sc.parallelize(exportlist) matchrdd = exportrdd.map(lambda x : (x,process.extractone(x, registerlist)[0])) outputrdd = matchrdd.map(lambda (x,y): (exportnamedict[x],establismentid[y],registerednamesdict[y])) outputrdd.saveastextfile(output_file)
can tell me how optimize it?please let me know if other information required.
Comments
Post a Comment