spark reading data from mysql in parallel -


im trying read data mysql , write parquet file in s3 specific partitions follows:

df=sqlcontext.read.format('jdbc')\    .options(driver='com.mysql.jdbc.driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",          dbtable='tbl',          numpartitions=4 )\    .load()   df2=df.withcolumn('updated_date',to_date(df.updated_at)) df2.write.parquet(path='s3n://parquet_location',mode='append',partitionby=['updated_date']) 

my problem open 1 connection mysql (instead of 4) , doesn't write parquert until fetches data mysql, because table in mysql huge (100m rows) process failed on outofmemory.

is there way configure spark open more 1 connection mysql , write partial data parquet?

you should set these properties:

partitioncolumn,  lowerbound,  upperbound,  numpartitions 

as documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

Python Pig Latin Translator -