Int mappartitionsrdd map counting val counts pairs

Resilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/View
Datasets

RDD is characterized by the following properties

- A list of partitions

SpeciXcation of custom partitions may provide signiXcant performance

improvements when using key-valueRDDs

Resilient Distributed Data Sets (RDDs)

val rdset = spark.sparkContext.parallelize(strings);
rdset: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at :26

Listing RDD

Creating RDD val lines = spark.sparkContext.textFile("sales.txt")
lines: org.apache.spark.rdd.RDD[String] = sales.txt MapPartitionsRDD[1] at textFile at :24

Listing RDD lines.collect()
res0: Array[String] = Array(bolt 45, bolt 5, drill 1, drill 1, screw 1, screw 2, screw 3)

Introduction to Spark
Outline

Resilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/Views
Datasets

DataFrames

A DataFrame can be created in the following way

dataFrame.show()

+----+-------+
| age| name |
+----+-------+
|null|Michael|
| 30 | Andy |
| 19 | Justin|
+----+-------+

root

|-- age: long (nullable = true)

SQL Tables/Views

| age| name |

+----+-------+

SQL Tables/Views

CREATE VIEW

CREATE TEMP VIEW just_usa_global AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'

SQL Tables/Views

Spark SQL view includes three core complex types: sets, lists, and structs

CREATE VIEW

SELECT DEST_COUNTRY_NAME as new_name,
collect_list(count) as flight_counts,
collect_set(ORIGIN_COUNTRY_NAME) as origin_set
FROM flights
GROUP BY DEST_COUNTRY_NAME

internal table in Hive

- Unmanaged table is a table that stores only data, it is equivalent to an external

DEST_COUNTRY_NAME STRING,

ORIGIN_COUNTRY_NAME STRING,

Introduction to Spark
Outline

Resilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/Views
Datasets

Datasets

A Dataset can be deXned, created, and used in the following way

Using a Dataset caseClassDS.select($"name").show()

Results +-----+
| name|
+-----+
|James|
+-----+