apache spark - Transforming an RDD to DataFrame -


hi new spark , trying transform rdd dataframe.the rdd folder many .txt files in it, , each 1 of them have paragraph of text.assume rdd this

val data = sc.textfile("data") 

i transform data dataframe this

  +------------+------+   |text        | code |     +----+-------+------|   |data of txt1|  1.0 |   |data of txt2|  1.0 | 

so column "text" should have raw data of each txt file , column "code" 1.0 appreciated.

val data = sc.textfile("data.txt")  *// schema encoded in string*  val schemastring = "text code"  *// import row.* import org.apache.spark.sql.row;  *// import spark sql data types* import org.apache.spark.sql.types.{structtype,structfield,stringtype};  *// generate schema based on string of schema* val schema = structtype( schemastring.split(" ").map(fieldname => structfield(fieldname, stringtype, true)))  *// convert records of rdd (data) rows.* val rowrdd = data.map(_.split(",")).map(p => row(p(0), p(1).trim))  *// apply schema rdd.* val datadataframe = sqlcontext.createdataframe(rowrdd, schema)  *// register dataframes table.* datadataframe.registertemptable("data")  *// sql statements can run using sql methods provided sqlcontext.* val results = sqlcontext.sql("select name data") 

adding data files not idea data loaded memory. going 1 file @ time better way.

but again depending on use case, if need data files need append rdds somehow.

hope answers question! cheers! :)


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

Python Pig Latin Translator -