xenonflash's comments

xenonflash · on June 4, 2014

He also picked one of the most trivial cases - unsigned addition. Signed data, especially signed multiplication, requires a lot more steps.

I'm working on a set of numerical problems in C now that involve checking for integer overflow. The best way of doing software overflow checks depends on the larger scope of the problem. You can knit your checks into various places in your code in ways that avoid unnecessary duplication of computational effort. However, that is a lot of work and has the potential for hidden programmer introduced errors.

The real problem is that languages like C simply don't allow you to take advantage of hardware which already exists in the CPU without writing in-line assembler. A standard set of "checked" math macros which handled the portability issues would probably satisfy most C applications.

Edit: For addition, subtraction, and multiplication, you just take one operand, calculate the largest possible second operand for that data type which won't overflow, and check that the actual second operand doesn't exceed it (remembering to take signs into account). For division and modulus, check for division by zero. For multiplication, division, modulus, negation, and absolute value of signed values, check that you are not negating the maximum negative integer, as integer ranges are not symmetrical (e.g. one byte is -128 to +127).

If you are looping over arrays and have multiple checks for different cases (e.g. negative, positive, etc.), then you can have different loops for different cases and so avoid redundant checks for that data. I'm working on this sort of application, so the above works out best for that. If you're doing something a bit different, then different algorithms may make sense. Unfortunately, there's not universal one-size-fits-all solution to this problem in software.

jcalvinowens · on June 5, 2014

> For addition, subtraction, and multiplication, you just take one operand, calculate the largest possible second operand for that data type which won't overflow, and check that the actual second operand doesn't exceed it

Sure, but that calculation is CPU-dependent, since it depends on how the underlying hardware represents signed integers. By definition, it is impossible to portably check for signed integer overflow in C, as I'm sure you know.

I implemented a simplistic BIGNUM library in C once (that's where I pulled that expanding multiply code in the other comment from). The only truly portable way to do that is to make your bignums sign-magnitude and use exclusively unsigned arithmetic on them. That's what I was envisioning in my original point about performance degradation due to overflow checking.

Realistically of course, most CPU's these days are twos-complement, and you can make signed overflow defined by compiling with "-fwrapv", which I would guess is what you're doing.

xenonflash · on June 5, 2014

Yes I'm assuming two's complement, but there's not a lot of hardware around these days that isn't two's complement. I'm writing a library for something that already assumes two's complement while doing other things.

If the code had to be portable to one's complement hardware, then I would create special cases for that type of hardware. Laying my hands on such hardware for testing would be the big problem, and if you haven't tested it, then how do you know that it works?

As for "-fwrapv", it's not portable either, and I need to cover both signed and unsigned math. It's also not compatible with what I need to link to (I've gone down this road already). I also need to cover the largest native word sizes, so the trick of using a larger word size won't work for me.

I'm only dealing with arrays of numbers though, so I can often amortize the checking calculations over many array elements instead of doing them each time. This is an example of knitting the checks into the overall algorithm instead of using a generic approach.

As things stand, there's currently no universal ones-size-fits-all answer to this problem in most languages.

I do like how Python has handled this - integers are infinitely expandable and simply can't overflow. This comes at the expense of performance though. What this type of solution needs is an option for unchecked native arithmetic for cases where you need maximum speed.

jcalvinowens · on June 6, 2014

> As for "-fwrapv" [...] It's also not compatible with what I need to link to (I've gone down this road already)

I'm actually really curious: exactly what issues did you run into with this? Intuitively I wouldn't think it would be a problem.

stormbrew · on June 4, 2014

It's surprising there aren't at least intrinsics for some of these things in GCC/clang. At least as far as I can tell.

Someone · on June 4, 2014

http://clang.llvm.org/docs/LanguageExtensions.html#builtin-f..., the search for "Checked Arithmetic Builtins"

stormbrew · on June 4, 2014

Nice! Thanks for the link, I admit I didn't do as much looking on the clang side as gcc. Doesn't look like these exist in gcc though, unfortunately. :/

xenonflash · on Oct 4, 2013

The Unladen Swallow project (faster Python, including an LLVM based JIT) found that LLVM wasn't really designed as a universal JIT compiler, or at least not one that was useful for languages such as Python. That was one of the reasons why they eventually stopped work on the project (although a lot of the non-LLVM work they did was still very good an useful and so the project as a whole was not wasted).

The Pypy developers (another Python with JIT project) looked very, very, carefully at every available JIT out there, and finally ended up writing their own. It was the only way that they could get something that would do what they needed. The results of doing that was performance that was far superior to what the Unladen Swallow project got when using LLVM.

It sounds very simple when you look at it from a distance. "We need a JIT, here's a JIT, let's use it." Then you find out that just because something is called a "JIT" doesn't mean that it will be of any use to what you are trying to do. The subject area is very complex and there will probably never be a universal solution.

What I could see the GCC JIT mode being useful for is things like generating certain critical portions of a program under programmer direction. That is, it would make a nice library that you call to generate very optimized code for specific functions or modules. A good example is how GCC is currently called in the background by a number of Python libraries which dynamically generate C code and compile it for faster execution. Being able to do this more directly via a JIT process could be very convenient. This is perhaps the sort of application that the author has in mind.

xenonflash · on June 26, 2013

I don't think its correct to say the CPython is slow. What you can more accurately say about CPython is that the performance is highly variable. Some things are very fast, while others are comparatively slow.

The slow things tend to be the sort of numerical loops that you see in micro-benchmarks. It's no coincidence that the version of Python in the linked article saw its greatest speed up in a numerical loop, but only modest improvement elsewhere. It's exactly this sort of simple repetitive operation where interpreter overhead matters the most.

Language features that encapsulate complex functionality tend to be harder to speed up in CPython because the VM operates at a fairly high level. In effect you're just kicking off a large subroutine that is written in C, and you're really executing native code until that operation is complete. You're not going to improve very much on that no matter how much you try.

What this means is that speed will depend heavily on the type of application program being written, and also on how much the programmer takes advantage of the unique language features. It also makes realistic cross language benchmarks difficult because the right way to do something in Python may not have a direct equivalent in another language. The result tends to be "lowest common denominator" benchmarks, which are exactly the sort of algorithms which CPython does worst at.

toolslive · on June 26, 2013

it really is a slow interpreter. Python comes with a pystone benchmark, and CPython is invariably the slowest of all interpreters. This being said, if you take a look at it's implementation, then it's immediately obvious why. The interpreter is a basically a simple C switch, with no optimization whatsoever. Simply threading the interpreter would make it about a factor 2 faster (at least that's what the experts claim you gain by threading)

xenonflash · on June 26, 2013

Pystone isn't a performance benchmark, or at least it isn't a useful one. It's more of a regression test to see if anything has changed between versions. It's not useful as a performance benchmark because it doesn't weight the results according to how much the individual features matter in real life. There are three versions of Python besides CPython that are in commercial use. Two are much slower than CPython (up to three times slower), and Pypy is (currently) faster in some applications and slower in others.

The CPython interpreter is not a simple switch. It uses computed gotos if you compile it with gcc. Microsoft VC doesn't have language support needed for writing fast interpreters, so the Python source is written in a way that will default to using a switch if you compile it with MS VC. So, on every platform except for one, it's a computed goto.

Modern CPU performance is very negatively affected by branch prediction failure and cache effects. A lot of the existing literature that you may see on interpreter performance is obsolete because it doesn't take those factors into account, but rather assumes that all code paths are equal. Threading worked well with older CPUs, not so well with newer ones.

I am current working on an interpreter that recognises a subset of Python for use as a library in complex mathematical algorithms. As part of this I have bench marked multiple different interpreter designs for it and also compared it to native ('C') code. It is possible to get a much faster interpreter, provided you limit it to doing very simple things repetitively. These simple things also happen to be the sorts of things which are popular with benchmark writers (because they're easy to write cross language benchmarks for), but which CPython does not do well in.

A sub-interpreter which targets these types of problems should give improved performance in this area. Rewriting the entire Python interpreter though would probably have little value, as the characteristics of opening a file or doing set operations, or handling exceptions are entirely different from adding two numbers together.

There is no such thing as a single speed "knob" which you can crank up or down to improve performance. There are many, many, features in modern programming languages, all of which have their own characteristics. Picking out a benchmark which happens to exercise one or a few of them will tell you nothing about how a real world application will perform unless it corresponds to the actual bottlenecks in your application. For that, you need to know the application domain and the language inside and out.

One thing about Python developers is that they tend to be very pragmatic. When someone comes to them with an idea, they say "show me the numbers in a real life situation". More often than not, the theoretical advantage of the approach being espoused evaporates when subjected to that type of analysis.

toolslive · on June 26, 2013

http://hg.python.org/cpython/file/16fe29689f3f/Python/ceval....

looks like a switch to me.

Anyway, I've let them tell me the CPython interpreter is very simple on purpose to allow it to function as a standard 'definition' of the language behaviour. A simple jit does wonders, as does a less brain dead gc. Superinstructions, threading, ... are all possible. But you're absolutely right: It's really difficult to predict how much each improvement would contribute.

xenonflash · on June 27, 2013

Have a look at the lines starting at line 821 in the very file you referenced. I have quoted a bit of it here:

"Computed GOTOs, or the-optimization-commonly-but-improperly-known-as-"threaded code" using gcc's labels-as-values extension (...) At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture."

They also have an explanation of the branch prediction effect which I mentioned earlier.

They have both methods (switch and computed goto) since some compilers don't support computed gotos, and some people want to use alternative compilers (e.g. Microsoft VC).

In my own interpreter, I tried both switch and computed gotos, as well as another method called "replicated switch". I auto-generate the interpreter source code (using a simple script) so that I could change methods easily for comparison. In my own testing, computed gotos were about 50% faster than a simple switch, but keep in mind that is strictly doing numerical type code. More complex operations would water that down somewhat, as less of the execution time would be due to dispatch overhead.

Computed gotos aren't really any more complex than a switch once you understand the format, and as I said above you can convert between the two with a simple script. What does get complex is doing Python level static or run time code optimization to try to predict types or remove redundant operations from loops. CPython doesn't do that, while Pypy does this extensively. It's these types of compiler and run-time re-compile optimizations which make the big difference.

Overall, my interpreter is currently about 5.5 times faster than CPython with the specific simple benchmark program I tested. However, keep in mind it only does (and only ever will do) a narrow subset of the full Python language. Performance is never the result of a single technique. It's the result of many small improvements each of which address a specific problem.

toolslive · on June 27, 2013

so the conclusion really is: CPython is way slower than it should be. Question: if the subset is small, isn't it better to use something like 'shed skin' ? http://code.google.com/p/shedskin/

I once looked at it, and it does a fairly literal translation. The only problem is that it changes semantics of the primitive types. For example a python integer becomes a C++ int. (and overflow semantics change)

xenonflash · on June 26, 2013

It's also used in CPU intensive applications, but the "standard" numerical libraries (Numpy/SciPy) aren't distributed with the language standard libraries. However, they're FOSS, available pretty much anywhere Python itself is, and anyone doing numerical work with Python will almost certainly use those libraries. They're not distributed with CPython itself in order to avoid being tied to a slower release schedule.

xenonflash · on June 26, 2013

If you don't have a lot of experience with Python you may not know how some of the "magic" parts actually work inside. Some of the things you can do however will end up allocating a lot of memory with extra copies of data that you don't need. If you do that a lot, you will chew through lots of memory. There are often two or more ways of accomplishing the same thing, with one way creating duplicates of data, and the other not. There are valid applications for both, and if you're only dealing with small amounts of data then the differences don't really matter.

Where a lot of people who are new to Python run into problems is that the language is deceptively easy. They try writing Python code that's simply a direct analogue of how they would write Java or C#. The resulting code will run, but it often be slow and a lot more verbose than necessary. Very often the way you would do something in Java or C# is the worst possible way to do it in Python. Conversely, the best way to do it in Python often has no direct analogue in Java or C#. With Python, the learning curve is shallow, but it's very long, and there's lots to learn if you want to reap all the benefits.

Without knowing what your data or algorithms are, it's pretty difficult to give any sensible detailed advice. However, if you are dealing with long "lists" of numbers, perhaps what you really want is long "arrays" of numbers. Lists and arrays are not the same thing in Python.

xenonflash · on March 28, 2013

1) The Python version has some basic newbie coding errors. This sort of code is what Python programmers call "Java written in Python". It may be a valid algorithm in Java, but it's the wrong way to do it in Python. Code like this will work, but it will be slow. Depending on the size of "queries", you are potentially allocating gobs of memory in two different places for no reason, and then throwing it away without using it. I wouldn't be surprised if the examples in other languages had similar problems.

2) The JSON serializer in Django 1.4 uses a method which is known to be very slow, but which is easily portable across different platforms and works with older versions of Python. They no doubt included for easy bundling. In a real application you would probably want to simply use the normal JSON serializer from the standard library (which is many times faster).

3) The examples are little more than "hello world". I did some benchmark tests with several Python async frameworks, Pypy, and Node.js for an application I was working on. With small JSON objects there wasn't much difference in performance. Once you started using large JSON objects the performance lines for all versions were indistinguishable from each other. The performance bottlenecks were in libraries, and those standard libraries were all written in 'C', so interpreter versus compiler versus JIT made little difference.

4) The problem with "toy" examples is that in real life there are two performance factors which must be taken into account. Think of as y = mx + b. With a toy example you are probably only measuring "b". With most real life applications it's "m" that matters. There are often different optimization approaches that are best for varying ratios of "b" and "m". You have to know your application intimately and benchmark using data which is realistic for that application.

Python has a reputation for being "easy to learn". However, it is "easy" in the sense of being able to hack something together that works without knowing very much. There can be several different ways of doing things and doing it one way versus another way can mean a difference in performance of several orders of magnitude. The same may be true for some of the other languages, but I haven't examined them in enough detail to say.