spark reading data from mysql in parallel -
im trying read data mysql , write parquet file in s3 specific partitions follows:
df=sqlcontext.read.format('jdbc')\ .options(driver='com.mysql.jdbc.driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""", dbtable='tbl', numpartitions=4 )\ .load() df2=df.withcolumn('updated_date',to_date(df.updated_at)) df2.write.parquet(path='s3n://parquet_location',mode='append',partitionby=['updated_date'])
my problem open 1 connection mysql (instead of 4) , doesn't write parquert until fetches data mysql, because table in mysql huge (100m rows) process failed on outofmemory.
is there way configure spark open more 1 connection mysql , write partial data parquet?
you should set these properties:
partitioncolumn, lowerbound, upperbound, numpartitions
as documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Comments
Post a Comment