For work, I often need to parse relatively large (tens to hundreds of GB) CSV files. These files are too large to fit in memory, but not really large enough for “big data” - Hadoop, etc. So, I prefer to process them on a single machine.
The CSV format is relatively simple:
- no quotes
- no escapes
The above means that one row occupies a single line, and the delimiter never occurs inside a cell value.
The tasks I need to solve are, in approximate order of increasing difficulty:
- Histogram of row size (how many columns each row has)
- For each column:
- Number and ratio of non-empty values
- The maximum, minimum and mean lengths of the values
- Number of unique values (this is hard)
- Top 20 most frequent values (this is hard)
For those playing along at home, there’s a bit write-up with full code examples here. It takes you on a wild journey with multiprocessing, pipes, threading, line_profiler, pympler and friends. For the lazy, a summary is below.
- More than one way to parse CSV - the best method depends on your application
- CSV parsing is CPU bound, but
- Splitting CSV files is I/O bound
- multiprocessing helps work around CPU-bound problems on multi-core machines
- threading helps work around I/O-bound problems
- pympler is helpful for memory profiling
- line_profiler is helpful for CPU usage profiling
- pipes module is helpful for using pipes within your Python programs
- Watch this video for an intro to Python profiling
- Watch this video for an intro to Python 3’s asyncio
- Russian speakers: watch this video for a good intro to memory and Python