For work, I often need to parse relatively large (tens to hundreds of GB) CSV files. These files are too large to fit in memory, but not really large enough for “big data” - Hadoop, etc. So, I prefer to process them on a single machine.

Problem Description

The CSV format is relatively simple:

  • pipe-delimited
  • no quotes
  • no escapes

The above means that one row occupies a single line, and the delimiter never occurs inside a cell value.

The tasks I need to solve are, in approximate order of increasing difficulty:

  • Histogram of row size (how many columns each row has)
  • For each column:
    • Number and ratio of non-empty values
    • The maximum, minimum and mean lengths of the values
    • Number of unique values (this is hard)
    • Top 20 most frequent values (this is hard)

Write-up

For those playing along at home, there’s a bit write-up with full code examples here. It takes you on a wild journey with multiprocessing, pipes, threading, line_profiler, pympler and friends. For the lazy, a summary is below.

Summary

  • More than one way to parse CSV - the best method depends on your application
  • CSV parsing is CPU bound, but
  • Splitting CSV files is I/O bound
  • multiprocessing helps work around CPU-bound problems on multi-core machines
  • threading helps work around I/O-bound problems
  • pympler is helpful for memory profiling
  • line_profiler is helpful for CPU usage profiling
  • pipes module is helpful for using pipes within your Python programs
  • Watch this video for an intro to Python profiling
  • Watch this video for an intro to Python 3’s asyncio
  • Russian speakers: watch this video for a good intro to memory and Python