java

how to save Spark RDD output in single file with header using java

Below code snippet shows how to save RDD output input single file with header: SparkConf conf = new SparkConf().setAppName("test").setMaster("local[2]"); JavaSparkContext jsc = new JavaSparkContext(conf);   JavaRDD headerRDD = jsc.parallelize(Arrays.asList(new String[]{"name,address,city"}), 1);   JavaRDD dataRDD=….;   //Make sure s.toString and header are in sync dataRDD= dataRDD.map(s->s.toString()); //Joined RDD JavaRDD joinedRDD= headerRDD.union(dataRDD);…

Spark Dataset Operations in java

I am gonna demonstrate step by step setup of spark project in this post and explore few basics Spark dataset operations in Java. Create Maven project with POM: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.ts.spark</groupId> <artifactId>api</artifactId> <version>1.0-SNAPSHOT</version>   <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId>…

Apache Beam Spark Runner example using Maven

In this post I will show you how to create Apache Beam Spark Runner project using Maven. Tools/ Frameworks used: Java 8 Apache Spark Maven Intellij Apache Beam Add Cloudera repository in maven settings.xml <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository><repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> full settings.xml file: <settings></settings><profiles><profile> <id>cld</id>   <repositories> <repository> <id>cloudera</id>…

Debugging custom libraries hive update logging to console

Debugging custom libraries hive update logging to console. When launch Hive cli change logging set root logging or your library logging to DEBUG or INFO and print to console- hive –hiveconf hive.root.logger=<INFO|DEBUG>,consolehive –hiveconf hive.root.logger=<INFO|DEBUG>,console Check Mapreduce jobs logs- http://jobtracker:<job tracker port e.g. 50030>/jobdetails.jsp?jobid=<job id> then go to map or reduce…