Don’t Slurp: How to Read Files in Python

j_baker · on Sept 27, 2013

I like that you can slurp a file in two lines of Python. And for someone just learning Python, the author's solution is just unnecessarily complicated. How many people learning Python are going to have a need to optimize file reading to this level?

And besides that, the author's solution is just good for one situation: when you want to read something line-by-line, which isn't always the case. For binary files, you may want to do something like this (untested code):

    data = infile.read(256):
    while data:
        do_something(data)
        data = infile.read(256)

Also, it seems like the author hasn't heard of enumerate: http://docs.python.org/2/library/functions.html#enumerate

paulgb · on Sept 27, 2013

I agree that it's good to have an easy way to read a whole file in, but when I think about it I can't think of any case where I've had to write code to read a file that I didn't want to process line-by-line, in which case the non-slurping method is actually less code.

j_baker · on Sept 28, 2013

It sounds like you've only read in line-by-line files then. I mean, it doesn't make sense to read in a JSON or XML document line-by-line. Nor does it make sense to read in most binary files line-by-line.

Some formats, such as csv do make sense to read in line-by-line though.

paulgb · on Sept 30, 2013

I've read in lots of formats, including many XML and binary files. 99% of the time there is a library that already handles the low-level interaction with those files. The only times I've had to write original Python code that reads files directly have typically been things that could be processed line by line (eg various record-based data formats). I'm trying to think of a counter example but I'm coming up blank.

mh- · on Sept 28, 2013

depends on how much you care about performance/memory usage.

for one, variance in the line lengths (by byte count) will probably force the usage of inefficient buffer sizes

also, the parent comment gave one universal example: binary files

repsilat · on Sept 28, 2013

Even with binary files, the only reasonable ways I can think to process them are stream-wise and with mmap. (In fact, why would you ever slurp a file when you can mmap it instead?)

vacri · on Sept 28, 2013

And for someone just learning Python

The context of the question and answers was 'a LinkedIn group for professional Python programmers'.

j_baker · on Sept 28, 2013

If you don't know how to slurp a file, you're not a professional Python programmer.

vacri · on Sept 29, 2013

The context of the answer was an expert talking to the experts who answered the newbie.

rlwolfcastle · on Sept 28, 2013

1. Get it to work

2. Test that it works

3. Optimize (if needed)

If your file is small enough and is a trivial percentage of the overall run time, who cares how it is read?

gwu78 · on Sept 27, 2013

Assuming this is true (filters are faster than slurping) how do we explain the perception of Perl and the many similar interpreters that followed as being "faster than the shell" (i.e. UNIX standard utilities and pipes). Since the dawn of Perl, and through the Python era, one could conclude that the shell and UNIX utilities have been all but abandoned for doing work with large files, in favor of using other interpreters and their myriad helper libraries.

My guess is that hardware improved and made slurping easier to do. Available RAM increased as hardware improved. This allowed slurping to displace filters as the choice way to work with files.

In a resource constrained environment, I still prefer filters to slurping. But how many developers or users today perceive their environment as resource constrained?

Terretta · on Sept 27, 2013

Except that Perl is filter mode. In Perl

    while (<FILE>) { print $_; }

is line by line and efficient. Perl programmers avoid

    my @entire_file=<$yourhandle>;

But even that loads the array line by line.

As a side note, good Perl has always done things while there are things to do, in a list processing kind of approach feeling more functional than imperative.

// Slurping the whole file into a single thing is actually a chore: http://stackoverflow.com/questions/206661/what-is-the-best-w...

mct · on Sept 28, 2013

If you're writing a filter, an even better pattern is:

        while (<>) {
                ...
        }

By omitting the file handle, "<>" will go line by line through each filename specified on the command line. If no filename arguments were passed, it reads from STDIN.

shabble · on Sept 28, 2013

I'm not sure if it's still the case (I recall some discussion about optimising it) but using the more obvious[1] idiom of

    foreach my $line (<$FH>) { ... }

slurps the entire file because the foreach is forcing list context on the diamond operator.

The

    while(<$FH>) { ... }

that you describe works as desired though.

http://stackoverflow.com/questions/585341/ appears to cover it in some detail.

[1] To me, anyway.

pnathan · on Sept 27, 2013

Actually, when I was using Perl a while ago (couple years now), `while(<$fh>) { ... }` was really the memory efficient way to do file work.

scott_s · on Sept 27, 2013

Others have already pointed out that looping over the lines in the file (which uses iterators) is the more obvious way to do it. But, there's an even better way. Check out "Generator Tricks for System Programmers": http://www.dabeaz.com/generators/

It has been submitted to HN many times: https://www.hnsearch.com/search#request/all&q=generator+tric...

jdnier · on Sept 27, 2013

It would also be more idiomatic to write

    for i, line in enumerate(sys.stdin):
        print '{:>6} {}'.format(i, line[:-1])

riskable · on Sept 27, 2013

Unless you want your line numbers to start at 0, do this instead:

    import sys
    for lineno, line in enumerate(sys.stdin, 1):
        print('{:>6} {}'.format(lineno, line[:-1]))

The second argument to enumerate() is the 'start' (can also be a keyword). So by passing a 1 we start there instead of 0.

cpjk · on Sept 27, 2013

Forgive me, but doesn't file.readline() provide the same functionality for reading a single line at a time from a file?

chaosphere2112 · on Sept 27, 2013

Yeah, but aren't we actually supposed to do this?

  for line in file:

scott_s · on Sept 27, 2013

Yes, and that is the idiomatic way to read files in Python. I guess me and the author are looking at different Python programs, because slurping the whole file is not "by far the most common way" I have seen files read in the wild.

rprospero · on Sept 27, 2013

Most of my code slurps the entire file at a time. I can't actually think of any code we have that streams the code.

Then again, most of my files are three dimensional matrices. There's nothing I can really do on a line by line basis.

gcr · on Sept 27, 2013

Yes, this is essentially looping over successive calls to file.readline()

tjgq · on Sept 28, 2013

With at least one small (and usually irrelevant) difference: "for line in file" has its own internal buffering which cannot be turned off with python -u.

I learned this the hard way when debugging a Python script that read from tail -f output...

gcr · on Sept 28, 2013

I thought a function that reads a line from a file would read the entire line and block until doing so. How would turning off line buffering change that? Do you expect 'for line in file' to yield as many bytes as is available if a full line can't be read yet?

mistercow · on Sept 28, 2013

>It also happens to nearly always be the wrong way to read a file

It's the wrong way in many cases, but it's the right way in a very large number of cases, if not in the majority of cases.

Often you need to read in a relatively small file, then do something trivial with it, or toss it through a couple of library-provided string processing functions. Or maybe the files are a bit bigger, and you're writing a script to automate some grunt work. You expect to run this script once. Or, more generally, maybe you're writing some code that gets called once a month and takes less than a second to complete.

In any of those situations, it would be silly to do anything other than slurping. String manipulation is easy to reason about. Stream processing is not.

Also note that most OSes cache files in memory, so if you are reading the same file often, the slowdown from reading the data into memory is drastically reduced.

rsobers · on Sept 28, 2013

Yup. There are literally millions of instances of code that slurps and Just Works and the users don't care and the investors don't care and the servers don't care and the programmer just moved on with life and knocked out the next feature. And nobody cared and it never mattered that it was "wrong."

pyre · on Sept 28, 2013

It could be used as a known vector to crash the program. Turn the file that it reads into something that will fill up RAM unless Python has some size checks before slurping.

mistercow · on Sept 28, 2013

An attacker with that access can also DoS a program that doesn't slurp using the same technique. Also, on a 64-bit system, a DoS is all you can realistically get from this vector anyway. I don't know what happens if you actually fill Python's available address space (it seems pretty difficult), but I'd be shocked if it were a crash, and not an exception.

kg · on Sept 27, 2013

Generally good advice, but note that in some cases 'slurping' is actually much better.

The most obvious one is where you can exploit parallelism (either CPU parallelism or storage parallelism) by fetching multiple entire files at once and preparing them in memory. This allows you to start spending CPU time processing one loaded file, while other files load in the background. When you stream a file one line at a time, other than some basic optimistic lookahead, it's not really possible for the OS to do as much to help you there, so you're going to be effectively single-threaded. If the computation you're doing on the data is significant, you can end up being unable to even maximize your use of a single core on computation.

bcoates · on Sept 28, 2013

If you're writing a filter, you probably want to use fileinput, which does most of the stuff you'd want in a "read lines from a bunch of files and do something" text-processing program.

http://docs.python.org/2/library/fileinput.html

riskable · on Sept 27, 2013

His example is OK but I was able to improve it significantly with a few minor changes:

    "A simple filter that prepends line numbers" # <-- Docstring
    import sys
    for fname in sys.argv[1:]: # ./program.py file1.txt file2.txt ...
        with open(fname) as f:
            # This reads in one line at a time from stdin
            for lineno, line in enumerate(f, 1): # Start at 1
                print '{:>6} {}'.format(lineno, line[:-1])

My way lets you pass as many files as you want to stdin, has a proper docstring, and uses the enumerate() function (so you don't need the silly `lineno = 0` and `lineno += 1` lines).

fredsanford · on Sept 28, 2013

And yours doesn't work with stdin...

What's with all the nitpicking?

nobodysfool · on Sept 28, 2013

nah, way better to use a generator...

    with open("a.txt") as f:
        c = ["{0} : {1}".format(x,y) for x,y in enumerate(f,1) ]
    for x in c:
        print x,

prutschman · on Sept 28, 2013

c is assigned a list, not a generator. Switching the square brackets for parenthesis creates a generator, but attempting to access the elements will fail because f was closed after exiting the 'with' scope.

  <ipython-input-8-e2c5ebe72b17> in <module>()
  ----> 1 for x in c:
        2     print x,
        3 
  
  <ipython-input-7-9460e3a04a4e> in <genexpr>(***failed resolving arguments***)
        1 with open("/tmp/foo.txt") as f:
  ----> 2     c = ("{0} : {1}".format(x,y) for x,y in enumerate(f,1))
        3 
  
  ValueError: I/O operation on closed file

gargh · on Sept 27, 2013

Doesn't the io module (http://docs.python.org/2/library/io.html) do this without resorting to non-typical python code?

richardjs · on Sept 27, 2013

Is it non-idiomatic to do "for [line] in [file object]"? I use "for line in open('file')" all the time, for similar reasons as presented in the article. "for line in sys.stdin" is basically the same pattern, just with a different file object.

Edit: The idiom's mentioned in the Python docs on IO [1] as "memory efficient, fast, and leads to simple code"

[1] http://docs.python.org/2/tutorial/inputoutput.html#methods-o...

icebraining · on Sept 27, 2013

Nitpick: it's better to do "with open('file') as file: for line in file: ..." instead, but otherwise yes, iterating over file objects is great.

Another option is using mmap[1], particularly when the file is already in memory or you need more random access to it. It worked well when I was trying to parse some lines from the end of an open log file.

[1] http://docs.python.org/2/library/mmap.html

richardjs · on Sept 27, 2013

Is there a difference between the two? I assumed that without the "with", the file would still be closed once the loop was exited (and thus the reference to the file is dropped), but I'm open to the possibility that I'm mistaken.

tjgq · on Sept 27, 2013

Without the context manager (with statement), the underlying file is closed when the file object is garbage-collected.

In CPython, since reference counting is used for GC, this occurs when the loop exits. However, other implementations (e.g. PyPy) may use different schemes that do not guarantee collection as soon as objects go out of scope. As an extreme, a valid and occasionally useful GC strategy is to never collect anything at all [0]!

Hence, if you want to portably ensure the file is closed, you should either use the context manager or call close() explicitly.

[0] http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047...

ijl · on Sept 27, 2013

Relying on GC to close the file isn't safe. See http://stackoverflow.com/questions/1834556/does-a-file-objec...

dfc · on Sept 27, 2013

Was anyone else unfamiliar with `/usr/bin/jot`? It looks like it is an obscure way of doing:

    $ seq 1 10000000

I think a lot of people forget about the beauty and power of coreutils.

mssaxm · on Sept 28, 2013

the example was generated on a mac, which uses the FreeBSD userland utilities. seq is not included in non-GNU user-land utilities (as it is not POSIX), jot is the (more-or-less) equivalent of seq for BSD systems.

286c8cb04bda · on Sept 28, 2013

  $ uname -s
  Darwin
  $ type seq
  seq is /usr/bin/seq

The man page says --

The seq command first appeared in Plan 9 from Bell Labs. A seq command appeared in NetBSD 3.0, and ported to FreeBSD 9.0. This command was based on the command of the same name in Plan 9 from Bell Labs and the GNU core utilities. The GNU seq command first appeared in the 1.13 shell utilities release.

webhat · on Sept 28, 2013

    So the moral of the story is that Python makes it simple and elegant to write stream-processors on line-buffered data-streams.

I thought every language had this or a similar method as a best practice when processing 'large' files.

jlujan · on Sept 27, 2013

Another nitpick. The article doesn't specifically say text file. Who is to say it has newlines at any reasonable spacing. Why is this front page.

samspenc · on Sept 27, 2013

Very interesting. This is how Hadoop streaming handles file I/O as well.

keypusher · on Sept 28, 2013

This is why you don't ask programming questions to LinkedIn.

wfunction · on Sept 28, 2013

How in the world is this "faster" as stated?

Eiwatah4 · on Sept 28, 2013

If you read the whole file, then process that file, the program first blocks on disk I/O for the time it takes to read the file, then it blocks the CPU for the time it takes to process it.

By streaming the data, you can do both at the same time. (The OS will already fetch the next line from the disk while you are still processing the previous one.)

pyre · on Sept 28, 2013

If you're processing something one line at a time, and outputting something based on each line, then you don't need to read-and-process the entire file before printing everything out.

Other than that, it's more memory efficient.

wfunction · on Sept 28, 2013

That's not "faster" (i.e. it doesn't take less time to run), it just has less delay from when it send back the first piece of output.

pyre · on Sept 28, 2013

As part of a larger system, removing the delay can cause other parts of the system to do their processing in parallel. While this doesn't reduce the amount of time that it takes that piece of the pipeline to run, it will reduce the total runtime of the system.

It can also be faster if you're searching for something in the file, because you can short-circuit reading the rest of the file when you find it.