Crunching 120 Million Flight Records: Optimizing Pandas for Big Data

Discover how to process 22 years of airline data on consumer hardware by benchmarking pure Python, PyPy, pandas optimizations, and PyArrow. Learn memory-efficient strategies for handling datasets 10x larger than RAM without distributed systems.

Processing massive datasets on commodity hardware remains a fundamental challenge in data engineering. A recent analysis of 120 million flight records spanning 1987-2008 demonstrates how Python developers can overcome memory constraints while optimizing performance – without resorting to distributed systems.

The Big Data Problem

The task: identify top 5 US airports by average domestic flight delay using 13GB of CSV files (22 years of Department of Transportation data) on a laptop with 16GB RAM. A naive pandas approach fails catastrophically:

import pandas
# MemoryError guaranteed!
df = pandas.concat((pandas.read_csv(f'{year}.csv') for year in range(1987, 2009))

Figure: Execution time breakdown shows CSV parsing dominates processing

Memory-Efficient Approaches

Pure Python Solution

Stream processing with Python's csv module minimizes memory to just 1MB:

import csv
airports = {}
for row in reader:
    # Calculate delay per flight
    if airport not in airports:
        airports[airport] = [1, delay]  # [count, total_delay]
    else:
        airports[airport][0] += 1
        airports[airport][1] += delay

Result: 7 minutes runtime (CPython), 4m40s on PyPy – viable but slow.

Optimized Pandas

Strategic optimizations cut memory usage by 50%:

Column pruning (usecols)
Efficient dtypes (category, uint8)
PyArrow CSV engine

df = pandas.read_csv(..., 
                    engine='pyarrow',
                    dtype={'Origin':'category', 'Month':'uint8'})

Result: 2m45s (pandas) → 1m10s (pandas+PyArrow)

Figure: Runtime comparison across implementations (lower is better)

PyArrow Hybrid Approach

Processing year-by-year avoids monolithic DataFrames:

results = []
for year in range(1987, 2009):
    df = pyarrow.csv.read_csv(f'{year}.csv').to_pandas()
    # Calculate yearly aggregates
    results.append(df.groupby('Origin')['delay'].agg(['sum','count']))

# Combine partial results
final = functools.reduce(lambda x,y: x.add(y), results)
final['mean'] = final['sum'] / final['count']

Result: 37 seconds peak (900MB RAM) – fastest with minimal memory overhead.

Figure: Memory consumption comparison across approaches (log scale)

Performance Insights

Key discoveries from benchmarking:

PyArrow accelerates I/O: Multithreaded parsing cut CSV load times by 60%
Columnar memory matters: Dictionary-encoded categoricals reduced Origin storage by 95%
Chunking beats scaling: Yearly processing with aggregation outperformed parallel multiprocessing (53s)
Hardware bottlenecks: NVMe SSDs enabled threading gains; spinning disks may prefer compression

Beyond Pandas

While this analysis used pandas and PyArrow, the techniques generalize:

Disk formats: Parquet/ORC could reduce I/O further
Compute engines: Polars/Vaex offer alternative optimizations
JIT compilation: Numba could accelerate datetime operations

The winning solution identified Muskegon County Airport (MKG) as the delay leader – travelers should pack extra reading material for its average 6-hour delays.

Source: DataPythonista | Flight data: Harvard Dataverse