Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Don’t Slurp: How to Read Files in Python (axialcorps.com)
47 points by mssaxm on Sept 27, 2013 | hide | past | favorite | 53 comments


I like that you can slurp a file in two lines of Python. And for someone just learning Python, the author's solution is just unnecessarily complicated. How many people learning Python are going to have a need to optimize file reading to this level?

And besides that, the author's solution is just good for one situation: when you want to read something line-by-line, which isn't always the case. For binary files, you may want to do something like this (untested code):

    data = infile.read(256):
    while data:
        do_something(data)
        data = infile.read(256)
Also, it seems like the author hasn't heard of enumerate: http://docs.python.org/2/library/functions.html#enumerate


I agree that it's good to have an easy way to read a whole file in, but when I think about it I can't think of any case where I've had to write code to read a file that I didn't want to process line-by-line, in which case the non-slurping method is actually less code.


It sounds like you've only read in line-by-line files then. I mean, it doesn't make sense to read in a JSON or XML document line-by-line. Nor does it make sense to read in most binary files line-by-line.

Some formats, such as csv do make sense to read in line-by-line though.


I've read in lots of formats, including many XML and binary files. 99% of the time there is a library that already handles the low-level interaction with those files. The only times I've had to write original Python code that reads files directly have typically been things that could be processed line by line (eg various record-based data formats). I'm trying to think of a counter example but I'm coming up blank.


depends on how much you care about performance/memory usage.

for one, variance in the line lengths (by byte count) will probably force the usage of inefficient buffer sizes

also, the parent comment gave one universal example: binary files


Even with binary files, the only reasonable ways I can think to process them are stream-wise and with mmap. (In fact, why would you ever slurp a file when you can mmap it instead?)


And for someone just learning Python

The context of the question and answers was 'a LinkedIn group for professional Python programmers'.


If you don't know how to slurp a file, you're not a professional Python programmer.


The context of the answer was an expert talking to the experts who answered the newbie.


1. Get it to work

2. Test that it works

3. Optimize (if needed)

If your file is small enough and is a trivial percentage of the overall run time, who cares how it is read?


Assuming this is true (filters are faster than slurping) how do we explain the perception of Perl and the many similar interpreters that followed as being "faster than the shell" (i.e. UNIX standard utilities and pipes). Since the dawn of Perl, and through the Python era, one could conclude that the shell and UNIX utilities have been all but abandoned for doing work with large files, in favor of using other interpreters and their myriad helper libraries.

My guess is that hardware improved and made slurping easier to do. Available RAM increased as hardware improved. This allowed slurping to displace filters as the choice way to work with files.

In a resource constrained environment, I still prefer filters to slurping. But how many developers or users today perceive their environment as resource constrained?


Except that Perl is filter mode. In Perl

    while (<FILE>) { print $_; }
is line by line and efficient. Perl programmers avoid

    my @entire_file=<$yourhandle>;
But even that loads the array line by line.

As a side note, good Perl has always done things while there are things to do, in a list processing kind of approach feeling more functional than imperative.

// Slurping the whole file into a single thing is actually a chore: http://stackoverflow.com/questions/206661/what-is-the-best-w...


If you're writing a filter, an even better pattern is:

        while (<>) {
                ...
        }
By omitting the file handle, "<>" will go line by line through each filename specified on the command line. If no filename arguments were passed, it reads from STDIN.


I'm not sure if it's still the case (I recall some discussion about optimising it) but using the more obvious[1] idiom of

    foreach my $line (<$FH>) { ... }
slurps the entire file because the foreach is forcing list context on the diamond operator.

The

    while(<$FH>) { ... }
that you describe works as desired though.

http://stackoverflow.com/questions/585341/ appears to cover it in some detail.

[1] To me, anyway.


Actually, when I was using Perl a while ago (couple years now), `while(<$fh>) { ... }` was really the memory efficient way to do file work.


Others have already pointed out that looping over the lines in the file (which uses iterators) is the more obvious way to do it. But, there's an even better way. Check out "Generator Tricks for System Programmers": http://www.dabeaz.com/generators/

It has been submitted to HN many times: https://www.hnsearch.com/search#request/all&q=generator+tric...


It would also be more idiomatic to write

    for i, line in enumerate(sys.stdin):
        print '{:>6} {}'.format(i, line[:-1])


Unless you want your line numbers to start at 0, do this instead:

    import sys
    for lineno, line in enumerate(sys.stdin, 1):
        print('{:>6} {}'.format(lineno, line[:-1]))
The second argument to enumerate() is the 'start' (can also be a keyword). So by passing a 1 we start there instead of 0.


Forgive me, but doesn't file.readline() provide the same functionality for reading a single line at a time from a file?


Yeah, but aren't we actually supposed to do this?

  for line in file:


Yes, and that is the idiomatic way to read files in Python. I guess me and the author are looking at different Python programs, because slurping the whole file is not "by far the most common way" I have seen files read in the wild.


Most of my code slurps the entire file at a time. I can't actually think of any code we have that streams the code.

Then again, most of my files are three dimensional matrices. There's nothing I can really do on a line by line basis.


Yes, this is essentially looping over successive calls to file.readline()


With at least one small (and usually irrelevant) difference: "for line in file" has its own internal buffering which cannot be turned off with python -u.

I learned this the hard way when debugging a Python script that read from tail -f output...


I thought a function that reads a line from a file would read the entire line and block until doing so. How would turning off line buffering change that? Do you expect 'for line in file' to yield as many bytes as is available if a full line can't be read yet?


>It also happens to nearly always be the wrong way to read a file

It's the wrong way in many cases, but it's the right way in a very large number of cases, if not in the majority of cases.

Often you need to read in a relatively small file, then do something trivial with it, or toss it through a couple of library-provided string processing functions. Or maybe the files are a bit bigger, and you're writing a script to automate some grunt work. You expect to run this script once. Or, more generally, maybe you're writing some code that gets called once a month and takes less than a second to complete.

In any of those situations, it would be silly to do anything other than slurping. String manipulation is easy to reason about. Stream processing is not.

Also note that most OSes cache files in memory, so if you are reading the same file often, the slowdown from reading the data into memory is drastically reduced.


Yup. There are literally millions of instances of code that slurps and Just Works and the users don't care and the investors don't care and the servers don't care and the programmer just moved on with life and knocked out the next feature. And nobody cared and it never mattered that it was "wrong."


It could be used as a known vector to crash the program. Turn the file that it reads into something that will fill up RAM unless Python has some size checks before slurping.


An attacker with that access can also DoS a program that doesn't slurp using the same technique. Also, on a 64-bit system, a DoS is all you can realistically get from this vector anyway. I don't know what happens if you actually fill Python's available address space (it seems pretty difficult), but I'd be shocked if it were a crash, and not an exception.


Generally good advice, but note that in some cases 'slurping' is actually much better.

The most obvious one is where you can exploit parallelism (either CPU parallelism or storage parallelism) by fetching multiple entire files at once and preparing them in memory. This allows you to start spending CPU time processing one loaded file, while other files load in the background. When you stream a file one line at a time, other than some basic optimistic lookahead, it's not really possible for the OS to do as much to help you there, so you're going to be effectively single-threaded. If the computation you're doing on the data is significant, you can end up being unable to even maximize your use of a single core on computation.


If you're writing a filter, you probably want to use fileinput, which does most of the stuff you'd want in a "read lines from a bunch of files and do something" text-processing program.

http://docs.python.org/2/library/fileinput.html


His example is OK but I was able to improve it significantly with a few minor changes:

    "A simple filter that prepends line numbers" # <-- Docstring
    import sys
    for fname in sys.argv[1:]: # ./program.py file1.txt file2.txt ...
        with open(fname) as f:
            # This reads in one line at a time from stdin
            for lineno, line in enumerate(f, 1): # Start at 1
                print '{:>6} {}'.format(lineno, line[:-1])
My way lets you pass as many files as you want to stdin, has a proper docstring, and uses the enumerate() function (so you don't need the silly `lineno = 0` and `lineno += 1` lines).


And yours doesn't work with stdin...

What's with all the nitpicking?


nah, way better to use a generator...

    with open("a.txt") as f:
        c = ["{0} : {1}".format(x,y) for x,y in enumerate(f,1) ]
    for x in c:
        print x,


c is assigned a list, not a generator. Switching the square brackets for parenthesis creates a generator, but attempting to access the elements will fail because f was closed after exiting the 'with' scope.

  <ipython-input-8-e2c5ebe72b17> in <module>()
  ----> 1 for x in c:
        2     print x,
        3 
  
  <ipython-input-7-9460e3a04a4e> in <genexpr>(***failed resolving arguments***)
        1 with open("/tmp/foo.txt") as f:
  ----> 2     c = ("{0} : {1}".format(x,y) for x,y in enumerate(f,1))
        3 
  
  ValueError: I/O operation on closed file


Doesn't the io module (http://docs.python.org/2/library/io.html) do this without resorting to non-typical python code?


Is it non-idiomatic to do "for [line] in [file object]"? I use "for line in open('file')" all the time, for similar reasons as presented in the article. "for line in sys.stdin" is basically the same pattern, just with a different file object.

Edit: The idiom's mentioned in the Python docs on IO [1] as "memory efficient, fast, and leads to simple code"

[1] http://docs.python.org/2/tutorial/inputoutput.html#methods-o...


Nitpick: it's better to do "with open('file') as file: for line in file: ..." instead, but otherwise yes, iterating over file objects is great.

Another option is using mmap[1], particularly when the file is already in memory or you need more random access to it. It worked well when I was trying to parse some lines from the end of an open log file.

[1] http://docs.python.org/2/library/mmap.html


Is there a difference between the two? I assumed that without the "with", the file would still be closed once the loop was exited (and thus the reference to the file is dropped), but I'm open to the possibility that I'm mistaken.


Without the context manager (with statement), the underlying file is closed when the file object is garbage-collected.

In CPython, since reference counting is used for GC, this occurs when the loop exits. However, other implementations (e.g. PyPy) may use different schemes that do not guarantee collection as soon as objects go out of scope. As an extreme, a valid and occasionally useful GC strategy is to never collect anything at all [0]!

Hence, if you want to portably ensure the file is closed, you should either use the context manager or call close() explicitly.

[0] http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047...


Relying on GC to close the file isn't safe. See http://stackoverflow.com/questions/1834556/does-a-file-objec...


Was anyone else unfamiliar with `/usr/bin/jot`? It looks like it is an obscure way of doing:

    $ seq 1 10000000
I think a lot of people forget about the beauty and power of coreutils.


the example was generated on a mac, which uses the FreeBSD userland utilities. seq is not included in non-GNU user-land utilities (as it is not POSIX), jot is the (more-or-less) equivalent of seq for BSD systems.


  $ uname -s
  Darwin
  $ type seq
  seq is /usr/bin/seq
The man page says --

The seq command first appeared in Plan 9 from Bell Labs. A seq command appeared in NetBSD 3.0, and ported to FreeBSD 9.0. This command was based on the command of the same name in Plan 9 from Bell Labs and the GNU core utilities. The GNU seq command first appeared in the 1.13 shell utilities release.


    So the moral of the story is that Python makes it simple and elegant to write stream-processors on line-buffered data-streams.
I thought every language had this or a similar method as a best practice when processing 'large' files.


Another nitpick. The article doesn't specifically say text file. Who is to say it has newlines at any reasonable spacing. Why is this front page.


Very interesting. This is how Hadoop streaming handles file I/O as well.


This is why you don't ask programming questions to LinkedIn.


How in the world is this "faster" as stated?


If you read the whole file, then process that file, the program first blocks on disk I/O for the time it takes to read the file, then it blocks the CPU for the time it takes to process it.

By streaming the data, you can do both at the same time. (The OS will already fetch the next line from the disk while you are still processing the previous one.)


If you're processing something one line at a time, and outputting something based on each line, then you don't need to read-and-process the entire file before printing everything out.

Other than that, it's more memory efficient.


That's not "faster" (i.e. it doesn't take less time to run), it just has less delay from when it send back the first piece of output.


As part of a larger system, removing the delay can cause other parts of the system to do their processing in parallel. While this doesn't reduce the amount of time that it takes that piece of the pipeline to run, it will reduce the total runtime of the system.

It can also be faster if you're searching for something in the file, because you can short-circuit reading the rest of the file when you find it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: