Convert CSV to Avro file in Java or scala

Is there any library for convert CSV to Avro file in Java or scala.

I tried to google it, But not able to find any library for it.

3 answers

  • answered 2018-01-14 10:55 pedrorijo91

    By googling I found this article: https://dzone.com/articles/convert-csv-data-avro-data

    quoting:

    To convert csv data to Avro data using Hive we need to follow the steps below:

    1. Create a Hive table stored as textfile and specify your csv delimiter also.
    2. Load csv file to above table using "load data" command.
    3. Create another Hive table using AvroSerDe.
    4. Insert data from former table to new Avro Hive table using "insert overwrite" command.

    Example: using a csv (student_id, subject_id, grade)

    --1. Create a Hive table stored as textfile
    USE test;
    CREATE TABLE csv_table (
    student_id INT,
    subject_id INT,
    marks INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE;
    
    --2. Load csv_table with student.csv data
    LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table;
    
    --3. Create another Hive table using AvroSerDe
    CREATE TABLE avro_table
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    TBLPROPERTIES (
        'avro.schema.literal'='{
          "namespace": "com.rishav.avro",
          "name": "student_marks",
          "type": "record",
          "fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}]
        }');
    
    --4. Load avro_table with data from csv_table
    INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table;
    

  • answered 2018-01-14 10:55 BluEOS

    You can do it easily by :

  • answered 2018-01-14 10:55 Bala

    You could try this way (Spark 1.6).

    people.csv
    
    Michael, 29
    Andy, 30
    Justin, 19
    

    Pyspark

    file = sc.textFile("people.csv")
    df = file.map(lambda line: line.split(',')).toDF(['name','age'])
    
    >>> df.show()
    +-------+---+
    |   name|age|
    +-------+---+
    |Michael| 29|
    |   Andy| 30|
    | Justin| 19|
    +-------+---+
    
    df.write.format("com.databricks.spark.avro").save("peopleavro")
    

    Peopleavro

    {u'age': u' 29', u'name': u'Michael'}
    {u'age': u' 30', u'name': u'Andy'}
    {u'age': u' 19', u'name': u'Justin'}
    

    Should you need to maintain data types, then create a schema and pass it while converting to DF.