Int mappartitionsrdd map counting val counts pairs
Resilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/View
Datasets
RDD is characterized by the following properties
- A list of partitions
SpeciXcation of custom partitions may provide signiXcant performance
improvements when using key-valueRDDs
Resilient Distributed Data Sets (RDDs)
val rdset = spark.sparkContext.parallelize(strings);
rdset: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at :26Listing RDD
Creating RDD val lines = spark.sparkContext.textFile("sales.txt")
lines: org.apache.spark.rdd.RDD[String] = sales.txt MapPartitionsRDD[1] at textFile at :24Listing RDD lines.collect()
res0: Array[String] = Array(bolt 45, bolt 5, drill 1, drill 1, screw 1, screw 2, screw 3)
Introduction to Spark
OutlineResilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/Views
Datasets
DataFrames
A DataFrame can be created in the following way
dataFrame.show()
+----+-------+
| age| name |
+----+-------+
|null|Michael|
| 30 | Andy |
| 19 | Justin|
+----+-------+
root
|-- age: long (nullable = true)
SQL Tables/Views
| age| name |
+----+-------+
SQL Tables/Views
SQL Tables/Views
CREATE VIEW
CREATE TEMP VIEW just_usa_global AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'
SQL Tables/Views
Spark SQL view includes three core complex types: sets, lists, and structs
CREATE VIEW
SELECT DEST_COUNTRY_NAME as new_name,
collect_list(count) as flight_counts,
collect_set(ORIGIN_COUNTRY_NAME) as origin_set
FROM flights
GROUP BY DEST_COUNTRY_NAME
internal table in Hive
- Unmanaged table is a table that stores only data, it is equivalent to an external
DEST_COUNTRY_NAME STRING,
ORIGIN_COUNTRY_NAME STRING,
Introduction to Spark
OutlineResilient Distributed Data Sets (RDDs)
DataFrames
SQL Tables/Views
Datasets
Datasets
A Dataset can be deXned, created, and used in the following way
Using a Dataset caseClassDS.select($"name").show()
Results +-----+
| name|
+-----+
|James|
+-----+