Python vs. C: file reading performance

Posted by spirulence on May 3, 2012, 2:07 a.m.

TL DR;

I am a Python guy. I write Python at work, I write Python for fun, and I've even dabbled with writing Python outside in the fresh air. Someday I hope I can plug a keyboard into a Kindle and actually code outside comfortably.

I've also been reading a textbook called Compiler Design in C lately. I've just gotten to the part where the author describes a relatively complex way of reading files with a minimum of copying.

(Coming from a background that rarely cares about performance being better than "good enough," it's different to be reading about designing for high performance in the first place.)

WHY??

In the text the author claims that "MS-DOS read and write times improve dramatically when you read 32k bytes at a time." I had to test this, and I figured I could pit C vs. Python in a very shallow, distorted way at the same time.

The Setup

I originally did this test reading the same small file chunk over and over again, but I realized that this probably takes advantage of OS caching and becomes a test of this caching rather than of the speed of the two languages.

So I set up an 8GB file, filled with the string "0123456789ABCDEF" over and over and over. Then, for each buffer size, the two languages do 2000 sequential reads of the file.

Pitfalls

Sequential and random reads are known to produce different characteristics. It would probably have produced better results if I had done a series of random reads instead of sequential ones.

2,000 iterations is not really enough iterations to establish behavior solidly, but I didn't actually think of doing random reads until just now, and there was no way I was going to set up a 40GB file so that I could do 10,000 reads of 4MB each.

I didn't do a whole lot of research into the buffering modes that Python offers for doing file reads. Some of those would make a difference. I have a feeling that normal file reads are internally buffered and copied at least once. That's a huge advantage for C, because read() is purported to allow the OS to copy straight from disk into your buffer if the buffer is the right size. At least it was allowed in 1990 when this book came out.

The Results

So vanilla Python reads are half as fast as C's read(). Big whoop. I was expecting much worse, perhaps 5-7x slower.

At least on Windows 7, these limited benchmarks indicate an optimal C buffer size somewhere between and including 32K and 1M. I'm convinced that the high read speeds below 32K for C would disappear entirely if I were doing random reads.

For Python, I'm not sure what to recommend. The highest speeds were with 4K, but that just seems too low to make sense. More research required.

The Stuff

The Excel Spreadsheet

The Code

Comments

firestormx 12 years ago

Quote:
The lack of a clear structure is more like it.
This, and the way it goes against most other programming conventions.

And I wouldn't really hate it as much if it wasn't praised so much by idiots, and if I wasn't forced to use it.

spirulence 12 years ago

See, when I look at those curly braces, I just think "Why? You've got the whitespace in there already!" [:P]

But, seriously, here's some advice which I think you can take about 75% to the bank.

Quote:
Which programming language you learn and use doesn't matter. Do not get sucked into the religion surrounding programming languages as that will only blind you to their true purpose of being your tool for doing interesting things.
http://learnpythonthehardway.org/book/advice.html

Damn me if that isn't hard advice to follow most of the time. :/

spirulence 12 years ago

Quote:
Of course, depending on which C/C++ functions you use for the IO, you will gain different results.

So on MinGW, there are 4 ways to do this in the stdlib, right?

1. open(), close(), read(), which all work on file descriptors

2. the FILE * functions

3. IOStreams

4. inline assembler

svf 12 years ago

C is and always will be the dominate language. ;)

firestormx 12 years ago

Curly braces give it structure, and make it feel less flimsy.

I personally think that quotes is about 75% BS. Let me interrupt myself, and point out that I'm not trying to be a dick to you, I just despise python (the fact that I don't smile in my avatar makes my posts look serious and angry). =P

But yeah, there's a lot of reasons why one programming language is a better choice than another for certain situations. That guy is spoutin' hippie crap about all languages being equal! How dare he! *shakes fist*

spirulence 12 years ago

What I choose to think he means about programming languages is more like this:

Measure a programming language not by the people who use it, but by its efficacy at doing the job you need done.

Your post needs more flagrant one-liners to be dickish, IMO. :)

Rob 12 years ago

Quote:
C is and always will be the dominate language. ;)

Go make a website in C.