Advanced
Apache Spark Introduction
An approachable first look at distributed data processing with Spark.
What Is Spark?
Apache Spark is an open-source engine for large-scale data processing. It distributes work across many machines, allowing it to process datasets far too large to fit or process efficiently on a single computer.
Why Distributed Processing?
When data grows beyond what one machine can handle in reasonable time, splitting the work across a cluster of machines lets each one process a portion in parallel, dramatically reducing total processing time.
A Simple Spark Example
PYTHON
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('OrdersExample').getOrCreate()
df = spark.read.csv('orders.csv', header=True, inferSchema=True)
summary = (
df.groupBy('customer_id')
.sum('order_amount')
.withColumnRenamed('sum(order_amount)', 'total_revenue')
)
summary.show()
This looks similar to a pandas script, but Spark can transparently scale the same logic across a cluster of machines as your data grows.
Continue Learning
Related Tutorials
Keep building your data engineering foundations.