I like that you can slurp a file in two lines of Python. And for someone just learning Python, the author's solution is just unnecessarily complicated. How many people learning Python are going to have a need to optimize file reading to this level?
And besides that, the author's solution is just good for one situation: when you want to read something line-by-line, which isn't always the case. For binary files, you may want to do something like this (untested code):
data = infile.read(256):
while data:
do_something(data)
data = infile.read(256)
I agree that it's good to have an easy way to read a whole file in, but when I think about it I can't think of any case where I've had to write code to read a file that I didn't want to process line-by-line, in which case the non-slurping method is actually less code.
It sounds like you've only read in line-by-line files then. I mean, it doesn't make sense to read in a JSON or XML document line-by-line. Nor does it make sense to read in most binary files line-by-line.
Some formats, such as csv do make sense to read in line-by-line though.
I've read in lots of formats, including many XML and binary files. 99% of the time there is a library that already handles the low-level interaction with those files. The only times I've had to write original Python code that reads files directly have typically been things that could be processed line by line (eg various record-based data formats). I'm trying to think of a counter example but I'm coming up blank.
Even with binary files, the only reasonable ways I can think to process them are stream-wise and with mmap. (In fact, why would you ever slurp a file when you can mmap it instead?)
Assuming this is true (filters are faster than slurping) how do we explain the perception of Perl and the many similar interpreters that followed as being "faster than the shell" (i.e. UNIX standard utilities and pipes). Since the dawn of Perl, and through the Python era, one could conclude that the shell and UNIX utilities have been all but abandoned for doing work with large files, in favor of using other interpreters and their myriad helper libraries.
My guess is that hardware improved and made slurping easier to do. Available RAM increased as hardware improved. This allowed slurping to displace filters as the choice way to work with files.
In a resource constrained environment, I still prefer filters to slurping. But how many developers or users today perceive their environment as resource constrained?
is line by line and efficient. Perl programmers avoid
my @entire_file=<$yourhandle>;
But even that loads the array line by line.
As a side note, good Perl has always done things while there are things to do, in a list processing kind of approach feeling more functional than imperative.
If you're writing a filter, an even better pattern is:
while (<>) {
...
}
By omitting the file handle, "<>" will go line by line through each filename specified on the command line. If no filename arguments were passed, it reads from STDIN.
Others have already pointed out that looping over the lines in the file (which uses iterators) is the more obvious way to do it. But, there's an even better way. Check out "Generator Tricks for System Programmers": http://www.dabeaz.com/generators/
Yes, and that is the idiomatic way to read files in Python. I guess me and the author are looking at different Python programs, because slurping the whole file is not "by far the most common way" I have seen files read in the wild.
With at least one small (and usually irrelevant) difference: "for line in file" has its own internal buffering which cannot be turned off with python -u.
I learned this the hard way when debugging a Python script that read from tail -f output...
I thought a function that reads a line from a file would read the entire line and block until doing so. How would turning off line buffering change that? Do you expect 'for line in file' to yield as many bytes as is available if a full line can't be read yet?
>It also happens to nearly always be the wrong way to read a file
It's the wrong way in many cases, but it's the right way in a very large number of cases, if not in the majority of cases.
Often you need to read in a relatively small file, then do something trivial with it, or toss it through a couple of library-provided string processing functions. Or maybe the files are a bit bigger, and you're writing a script to automate some grunt work. You expect to run this script once. Or, more generally, maybe you're writing some code that gets called once a month and takes less than a second to complete.
In any of those situations, it would be silly to do anything other than slurping. String manipulation is easy to reason about. Stream processing is not.
Also note that most OSes cache files in memory, so if you are reading the same file often, the slowdown from reading the data into memory is drastically reduced.
Yup. There are literally millions of instances of code that slurps and Just Works and the users don't care and the investors don't care and the servers don't care and the programmer just moved on with life and knocked out the next feature. And nobody cared and it never mattered that it was "wrong."
It could be used as a known vector to crash the program. Turn the file that it reads into something that will fill up RAM unless Python has some size checks before slurping.
An attacker with that access can also DoS a program that doesn't slurp using the same technique. Also, on a 64-bit system, a DoS is all you can realistically get from this vector anyway. I don't know what happens if you actually fill Python's available address space (it seems pretty difficult), but I'd be shocked if it were a crash, and not an exception.
Generally good advice, but note that in some cases 'slurping' is actually much better.
The most obvious one is where you can exploit parallelism (either CPU parallelism or storage parallelism) by fetching multiple entire files at once and preparing them in memory. This allows you to start spending CPU time processing one loaded file, while other files load in the background. When you stream a file one line at a time, other than some basic optimistic lookahead, it's not really possible for the OS to do as much to help you there, so you're going to be effectively single-threaded. If the computation you're doing on the data is significant, you can end up being unable to even maximize your use of a single core on computation.
If you're writing a filter, you probably want to use fileinput, which does most of the stuff you'd want in a "read lines from a bunch of files and do something" text-processing program.
His example is OK but I was able to improve it significantly with a few minor changes:
"A simple filter that prepends line numbers" # <-- Docstring
import sys
for fname in sys.argv[1:]: # ./program.py file1.txt file2.txt ...
with open(fname) as f:
# This reads in one line at a time from stdin
for lineno, line in enumerate(f, 1): # Start at 1
print '{:>6} {}'.format(lineno, line[:-1])
My way lets you pass as many files as you want to stdin, has a proper docstring, and uses the enumerate() function (so you don't need the silly `lineno = 0` and `lineno += 1` lines).
c is assigned a list, not a generator. Switching the square brackets for parenthesis creates a generator, but attempting to access the elements will fail because f was closed after exiting the 'with' scope.
<ipython-input-8-e2c5ebe72b17> in <module>()
----> 1 for x in c:
2 print x,
3
<ipython-input-7-9460e3a04a4e> in <genexpr>(***failed resolving arguments***)
1 with open("/tmp/foo.txt") as f:
----> 2 c = ("{0} : {1}".format(x,y) for x,y in enumerate(f,1))
3
ValueError: I/O operation on closed file
Is it non-idiomatic to do "for [line] in [file object]"? I use "for line in open('file')" all the time, for similar reasons as presented in the article. "for line in sys.stdin" is basically the same pattern, just with a different file object.
Edit: The idiom's mentioned in the Python docs on IO [1] as "memory efficient, fast, and leads to simple code"
Nitpick: it's better to do "with open('file') as file: for
line in file: ..." instead, but otherwise yes, iterating over file objects is great.
Another option is using mmap[1], particularly when the file is already in memory or you need more random access to it. It worked well when I was trying to parse some lines from the end of an open log file.
Is there a difference between the two? I assumed that without the "with", the file would still be closed once the loop was exited (and thus the reference to the file is dropped), but I'm open to the possibility that I'm mistaken.
Without the context manager (with statement), the underlying file is closed when the file object is garbage-collected.
In CPython, since reference counting is used for GC, this occurs when the loop exits. However, other implementations (e.g. PyPy) may use different schemes that do not guarantee collection as soon as objects go out of scope. As an extreme, a valid and occasionally useful GC strategy is to never collect anything at all [0]!
Hence, if you want to portably ensure the file is closed, you should either use the context manager or call close() explicitly.
the example was generated on a mac, which uses the FreeBSD userland utilities. seq is not included in non-GNU user-land utilities (as it is not POSIX), jot is the (more-or-less) equivalent of seq for BSD systems.
The seq command first appeared in Plan 9 from Bell Labs. A seq command appeared in NetBSD 3.0, and ported to FreeBSD 9.0. This command was based on the command of the same name in Plan 9 from Bell Labs and the GNU core utilities. The GNU seq command first appeared in the 1.13 shell utilities release.
If you read the whole file, then process that file, the program first blocks on disk I/O for the time it takes to read the file, then it blocks the CPU for the time it takes to process it.
By streaming the data, you can do both at the same time. (The OS will already fetch the next line from the disk while you are still processing the previous one.)
If you're processing something one line at a time, and outputting something based on each line, then you don't need to read-and-process the entire file before printing everything out.
As part of a larger system, removing the delay can cause other parts of the system to do their processing in parallel. While this doesn't reduce the amount of time that it takes that piece of the pipeline to run, it will reduce the total runtime of the system.
It can also be faster if you're searching for something in the file, because you can short-circuit reading the rest of the file when you find it.
And besides that, the author's solution is just good for one situation: when you want to read something line-by-line, which isn't always the case. For binary files, you may want to do something like this (untested code):
Also, it seems like the author hasn't heard of enumerate: http://docs.python.org/2/library/functions.html#enumerate