python - pyspark on cluster, make sure all nodes are used -


deployment info: "pyspark --master yarn-client --num-executors 16 --driver-memory 16g --executor-memory 2g "

i turning 100,000 line text file (in hdfs dfs format) rdd object corpus = sc.textfile("my_file_name"). when execute corpus.count() 100000. realize these steps performed on master node.

now, question when perform action new_corpus=corpus.map(some_function), job automatically distributed pyspark among available slaves (16 in case)? or have specify something?

notes:

  • i don't think gets distributed (or @ least not on 16 nodes) because when new_corpus.count(), prints out [stage some_number:> (0+2)/2], not [stage some_number:> (0+16)/16]
  • i don't think doing corpus = sc.textfile("my_file_name",16) solution me because function want apply works @ line level , therefore should applied 100,000 times (the goal of parallelization speed process, having each slave taking 100000/16 lines). should not applied 16 times on 16 subsets of original text file.

your observations not correct. stages not "executors". in spark have jobs, tasks , stages. job kicked off master driver , task assigned different worker nodes stage collection of task has same shuffling dependencies. in case shuffling happens once.

to check if executors 16, have resource manager. @ port 4040 since using yarn.

also if use rdd.map(), should parallelize according defined partitions , not executors set in sc.textfile("my_file_name", numpartitions).

here overview again: https://spark.apache.org/docs/1.6.0/cluster-overview.html


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -