Spark Dataset Operations in java

Spark Dataset Operations in java

I am gonna demonstrate step by step setup of spark project in this post and explore few basics Spark dataset operations in Java.

Create Maven project with POM:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.ts.spark</groupId>
    <artifactId>api</artifactId>
    <version>1.0-SNAPSHOT</version>
 
<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.0.0</version>
    </dependency>
 
</dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

Project structure:

Create Bean definition:

public class People implements Serializable {
private String name;
private Long age;
 
public People() {
 
}
 
public People(String name, Long age) {
this.name = name;
this.age = age;
}
 
public String getName() {
return name;
}
 
public void setName(String name) {
this.name = name;
}
 
public Long getAge() {
return age;
}
 
public void setAge(Long age) {
this.age = age;
}
}

Create people.json file in resource directory:
{“name”:”PhilipHester”,”age”:88}
{“name”:”DylanBecker”,”age”:64}
{“name”:”JohnCarpenter”,”age”:60}
{“name”:”DeclanBarton”,”age”:64}
{“name”:”KennedySutton”,”age”:91}
{“name”:”DolanRowland”,”age”:96}
{“name”:”JonahWhitaker”,”age”:41}

Filter content of dataset:

public class Application {
    public static void main(String[] args) {
       SparkSession session=SparkSession.builder().appName("dataset example").getOrCreate();
        /**
         * Define encoder, used to convert data to binary format in jvm
         */
        Encoder encode= Encoders.bean(People.class);
 
        /**
         * Load dataset from json
         */
        Dataset ds= session.read().json(Thread.currentThread().
getContextClassLoader().getResource("people.json").
getPath()).as(encode);
 
        ds.filter((FilterFunction<People>)s-> (s.getAge()>30)).show();
 
 
 
    }
}

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.