Home Data Engineering ETL & ELT Data Pipelines Tutorials Blog Databases Data Warehousing Big Data Cloud Data SQL Guides Python Guides Tools Glossary Resources About Contact
Advanced

Apache Spark Introduction

An approachable first look at distributed data processing with Spark.

What Is Spark?

Apache Spark is an open-source engine for large-scale data processing. It distributes work across many machines, allowing it to process datasets far too large to fit or process efficiently on a single computer.

Why Distributed Processing?

When data grows beyond what one machine can handle in reasonable time, splitting the work across a cluster of machines lets each one process a portion in parallel, dramatically reducing total processing time.

A Simple Spark Example

PYTHON
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('OrdersExample').getOrCreate()

df = spark.read.csv('orders.csv', header=True, inferSchema=True)

summary = (
    df.groupBy('customer_id')
      .sum('order_amount')
      .withColumnRenamed('sum(order_amount)', 'total_revenue')
)

summary.show()

This looks similar to a pandas script, but Spark can transparently scale the same logic across a cluster of machines as your data grows.