Spark solution for multiline csv which has EOLs in text column

Spark solution for multiline csv which has EOLs in text column

Spark processing multiline csv EOLs in text column

The multi line support for CSV will be added in spark version 2.2 JIRA and for now you can try below steps if you are facing issue while processing CSV:

Get InputFormat and reader classes from git to your code base and implement use it:
Java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

//implementation

JavaPairRDD rdd =
context.
newAPIHadoopFile(path, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD inputWithMultiline= rdd.map(s -> s._2().toString())

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.