python - pyspark on cluster, make sure all nodes are used -
deployment info: "pyspark --master yarn-client --num-executors 16 --driver-memory 16g --executor-memory 2g "
i turning 100,000 line text file (in hdfs dfs
format) rdd object corpus = sc.textfile("my_file_name")
. when execute corpus.count()
100000
. realize these steps performed on master node.
now, question when perform action new_corpus=corpus.map(some_function)
, job automatically distributed pyspark among available slaves (16 in case)? or have specify something?
notes:
- i don't think gets distributed (or @ least not on 16 nodes) because when
new_corpus.count()
, prints out[stage some_number:> (0+2)/2]
, not[stage some_number:> (0+16)/16]
- i don't think doing
corpus = sc.textfile("my_file_name",16)
solution me because function want apply works @ line level , therefore should applied 100,000 times (the goal of parallelization speed process, having each slave taking100000/16
lines). should not applied 16 times on 16 subsets of original text file.
your observations not correct. stages not "executors". in spark have jobs, tasks , stages. job kicked off master driver , task assigned different worker nodes stage collection of task has same shuffling dependencies. in case shuffling happens once.
to check if executors 16, have resource manager. @ port 4040 since using yarn.
also if use rdd.map(), should parallelize according defined partitions , not executors set in sc.textfile("my_file_name", numpartitions).
here overview again: https://spark.apache.org/docs/1.6.0/cluster-overview.html
Comments
Post a Comment