How do I convert Array[Row] to RDD[Row]
I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect() val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error
value rdd is not a member of Array[(String, String)]
Bidwhich you've created here is not a DataFrame, it is an
Array[Row], that's why you can't use
.rddon it. If you want to get an
RDD[Row], simply call
.rddon the DataFrame (without calling
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the
Array[Row]is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use
collecton the DataFrame, you've already collected entire data into a single JVM's memory. So using
parallelizeis not the issue.