Discover how to process 22 years of airline data on consumer hardware by benchmarking pure Python, PyPy, pandas optimizations, and PyArrow. Learn memory-efficient strategies for handling datasets 10x larger than RAM without distributed systems.
Processing massive datasets on commodity hardware remains a fundamental challenge in data engineering. A recent analysis of 120 million flight records spanning 1987-2008 demonstrates how Python developers can overcome memory constraints while optimizing performance – without resorting to distributed systems.
The Big Data Problem
The task: identify top 5 US airports by average domestic flight delay using 13GB of CSV files (22 years of Department of Transportation data) on a laptop with 16GB RAM. A naive pandas approach fails catastrophically:
import pandas
# MemoryError guaranteed!
df = pandas.concat((pandas.read_csv(f'{year}.csv') for year in range(1987, 2009))
Figure: Execution time breakdown shows CSV parsing dominates processing
Memory-Efficient Approaches
Pure Python Solution
Stream processing with Python's csv module minimizes memory to just 1MB:
import csv
airports = {}
for row in reader:
# Calculate delay per flight
if airport not in airports:
airports[airport] = [1, delay] # [count, total_delay]
else:
airports[airport][0] += 1
airports[airport][1] += delay
Result: 7 minutes runtime (CPython), 4m40s on PyPy – viable but slow.
Optimized Pandas
Strategic optimizations cut memory usage by 50%:
- Column pruning (
usecols) - Efficient dtypes (
category,uint8) - PyArrow CSV engine
df = pandas.read_csv(...,
engine='pyarrow',
dtype={'Origin':'category', 'Month':'uint8'})
Result: 2m45s (pandas) → 1m10s (pandas+PyArrow)
Figure: Runtime comparison across implementations (lower is better)
PyArrow Hybrid Approach
Processing year-by-year avoids monolithic DataFrames:
results = []
for year in range(1987, 2009):
df = pyarrow.csv.read_csv(f'{year}.csv').to_pandas()
# Calculate yearly aggregates
results.append(df.groupby('Origin')['delay'].agg(['sum','count']))
# Combine partial results
final = functools.reduce(lambda x,y: x.add(y), results)
final['mean'] = final['sum'] / final['count']
Result: 37 seconds peak (900MB RAM) – fastest with minimal memory overhead.
Figure: Memory consumption comparison across approaches (log scale)
Performance Insights
Key discoveries from benchmarking:
- PyArrow accelerates I/O: Multithreaded parsing cut CSV load times by 60%
- Columnar memory matters: Dictionary-encoded categoricals reduced
Originstorage by 95% - Chunking beats scaling: Yearly processing with aggregation outperformed parallel multiprocessing (53s)
- Hardware bottlenecks: NVMe SSDs enabled threading gains; spinning disks may prefer compression
Beyond Pandas
While this analysis used pandas and PyArrow, the techniques generalize:
- Disk formats: Parquet/ORC could reduce I/O further
- Compute engines: Polars/Vaex offer alternative optimizations
- JIT compilation: Numba could accelerate datetime operations
The winning solution identified Muskegon County Airport (MKG) as the delay leader – travelers should pack extra reading material for its average 6-hour delays.
Source: DataPythonista | Flight data: Harvard Dataverse

Comments
Please log in or register to join the discussion