Spark solution for multiline csv which has EOLs in text column

Spark solution for multiline csv which has EOLs in text column

Spark processing multiline csv EOLs in text column

The multi line support for CSV will be added in spark version 2.2 JIRA and for now you can try below steps if you are facing issue while processing CSV:

Get InputFormat and reader classes from git to your code base and implement use it:
Java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

//implementation


JavaPairRDD<longwritable, text=""> rdd =
context.
newAPIHadoopFile(<CSV file path>, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD inputWithMultiline= rdd.map(s -> s._2().toString())
</longwritable,>

Another solution for this problem is Apache Crunch CSV reader. This reader can be used like above FileCleaningInputFormat implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.