apache spark - Transforming an RDD to DataFrame -
hi new spark , trying transform rdd dataframe.the rdd folder many .txt files in it, , each 1 of them have paragraph of text.assume rdd this
val data = sc.textfile("data")
i transform data dataframe this
+------------+------+ |text | code | +----+-------+------| |data of txt1| 1.0 | |data of txt2| 1.0 |
so column "text" should have raw data of each txt file , column "code" 1.0 appreciated.
val data = sc.textfile("data.txt") *// schema encoded in string* val schemastring = "text code" *// import row.* import org.apache.spark.sql.row; *// import spark sql data types* import org.apache.spark.sql.types.{structtype,structfield,stringtype}; *// generate schema based on string of schema* val schema = structtype( schemastring.split(" ").map(fieldname => structfield(fieldname, stringtype, true))) *// convert records of rdd (data) rows.* val rowrdd = data.map(_.split(",")).map(p => row(p(0), p(1).trim)) *// apply schema rdd.* val datadataframe = sqlcontext.createdataframe(rowrdd, schema) *// register dataframes table.* datadataframe.registertemptable("data") *// sql statements can run using sql methods provided sqlcontext.* val results = sqlcontext.sql("select name data")
adding data files not idea data loaded memory. going 1 file @ time better way.
but again depending on use case, if need data files need append rdds somehow.
hope answers question! cheers! :)
Comments
Post a Comment