Read Large Text File with Generator

June 16, 2020

Introduction: Use generator to save overhead when reading large text files.


This is a quick note to help with reducing the memory overhead when reading large text files. I run into a problem where I need to process large text files between a few hundred MB to a few GB in size and the traditional way of loading the file into memory is significantly slowing down the computer. Sometimes the computer became unresponsive when the file is being loaded.

After searching and experiment a bit, the Python generator function seems to solve the problem. This post combines the two solutions offered in the following links:

https://www.journaldev.com/32059/read-large-text-files-in-python

https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

Before getting to the solution itself, I need to mention that one of the mental blocks I had was due to the fact that my original source file did not have a clean line break. I was using a scroll function to retrieve 10,000 Elasticsearch records at a time and simply writing them to another file before uploading it to an S3 bucket.

The files will look something similar to this:

{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}

The first step was to make the records clearly deliminated by line break in the original file, this added to the size but makes parsing a lot easier:

{...}

{...}

{...}

{...}

The first script uses the usual with open statement to open the file and loop over the lines:

#!/usr/bin/env python

import os

import resource

filename = 'largefile.txt'

print(f'File Size is {os.stat(filename).st_size / (1024 * 1024)} MB')

line_count = 0

with open(filename, 'r') as f:

....for line in f:

........line_count += 1

print(f'line count: {line_count}')

print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

The second script uses the yeild function to return each line for processing before loading the next one:

#!/usr/bin/env python

import os

import resource

filename = 'largefile.txt'

print(f'File Size is {os.stat(filename).st_size / (1024 * 1024)} MB')

def read_large_file(filename):

....line_count = 0

....with open(filename, 'r') as f:

........for line in f:

............line_count += 1

............yield line

if __name__ == "__main__":

....read_large_file(filename)

....print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Here are the performace results, the memory number are in kilobytes (https://manpages.debian.org/buster/manpages-dev/getrusage.2.en.html):

$ python read_attemp_1.py

File Size is 213.77720069885254 MB

line count: 159020

7364608

$ python read_attemp_2.py

File Size is 213.77720069885254 MB

6709248

As the link on top indicates, we can also use the f.load() to load a chunk of file contents into buffer first if the files are not structured with line breaks.

I hope this offers some value to people when they need to process a large text files, I know I will come back to this post from time to time when the need arises.

Happy coding,

Eric

Return to blog