Spark in java code samples

Writing Spark code with java sometimes can be difficult as most of the samples are in Scala and Scala to Java conversion isn’t straightforward many times. In this blog, I will be sharing most of the common Spark operations using Java. Spark init- SparkConf conf = new SparkConf();conf.setMaster(“local[2]”);conf.setAppName(“Test”);SparkSession session =…

how to save Spark RDD output in single file with header using java

Below code snippet shows how to save RDD output input single file with header: SparkConf conf = new SparkConf().setAppName("test").setMaster("local[2]"); JavaSparkContext jsc = new JavaSparkContext(conf);   JavaRDD headerRDD = jsc.parallelize(Arrays.asList(new String[]{"name,address,city"}), 1);   JavaRDD dataRDD=….;   //Make sure s.toString and header are in sync dataRDD= dataRDD.map(s->s.toString()); //Joined RDD JavaRDD joinedRDD= headerRDD.union(dataRDD);…

Spark Dataset Operations in java

I am gonna demonstrate step by step setup of spark project in this post and explore few basics Spark dataset operations in Java. Create Maven project with POM: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.ts.spark</groupId> <artifactId>api</artifactId> <version>1.0-SNAPSHOT</version>   <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId>…

Apache Beam Spark Runner example using Maven

In this post I will show you how to create Apache Beam Spark Runner project using Maven. Tools/ Frameworks used: Java 8 Apache Spark Maven Intellij Apache Beam Add Cloudera repository in maven settings.xml <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository><repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> full settings.xml file: <settings></settings><profiles><profile> <id>cld</id>   <repositories> <repository> <id>cloudera</id>…