In my experience, almost all overhead in Go programs will result from being bitten by memmove, alloc and GC. Profiling any program that uses interface{} and byte buffers will quickly expose you to how much the runtime loves to reallocate and copy objects. In one instance I got 10x throughput improvement by switching from a generic interface (take an interface{} and detect the type) to using typed methods and hand-rolling a shim that called the appropriate methods.
Do people actually use `interface{}` so liberally? I've never encountered it in the wild (except by newbies or as an example of what not to do) and I've always been under the impression that it's been heavily discouraged for a very long time.
interface{} is heavily used in many of the Go std libraries. Particularly for what I call "convenience methods" where you want to be able to pass in various types and it will do a type switch for you and then determine what to do.
Part of what I was discovering in this blog post is that if you look under the hood at these convenience methods, and you know in advance what type are passing (i.e., you are not using interface{}), you can often find the "direct" call to use that uses a concrete type and likely get better performance out of it.
While very neat in my experience this has two issues (using Python's line_profiler so it may not have the same problems):
1. it adds significant overhead to execution speed which means savings under line_profiler and savings without it may only be distantly related, not sure how much overhead pprof adds
2. it requires knowing which functions should be line_profiled, because when you have thousands or millions of LOCs, you've got no idea what to line-profile
I've never found "usual" whole-program profilers to be great at the latter, it may just be that I'm bad at reading them but they never really click, and when you've got a few "leaf functions" called from basically everywhere, they end up having a low SNR. Recently however I've started using sampling profilers and flamegraph representations[0] (or sunbursts, but there's no standard tool for that one) and found it to be a significantly superior way of identifying bottlenecks with very high SNR.
"While very neat in my experience this has two issues (using Python's line_profiler so it may not have the same problems)... Recently however I've started using sampling profilers"
pprof is a sampling profiler. You can get it from Google for C, too. I'm not sure if Go has a "port" or just wrote something with the same ideas, but, well, I guess you could say I don't know precisely because it hardly matters.
> Recently however I've started using sampling profilers and flamegraph representations[0]
Maybe you should take a closer look at what pprof does and not draw conclusions only from Python's line_profiler. Go's profiler doesn't have flamegraphs, but it has directed graphs. Here's an example:
As you can see, it is very easy to see the bottlenecks. In fact the first thing I always do when profiling is outputting this svg to have a good first view of what's happening.
pprof is a "global" whole-program profilers, so you don't need to care about what line to profile or not. My only limitation for the moment is that you can't dive in cgo code.
> Maybe you should take a closer look at what pprof does and not draw conclusions only from Python's line_profiler. Go's profiler doesn't have flamegraphs, but it has directed graphs.
Python profilers have directed graph representations, they are little better than the normal textual output.
> As you can see, it is very easy to see the bottlenecks.
With trivial programs, the most terrible representations are no issue. Cycles in callstacks (partial mutual recursion) or "hot nodes" (dispatcher code) completely break directed graph output, but not flamegraphs.
pprof the tool is really just a visualization tool more than a profiler itself[0]. But yes, enabling it will enable it for the entire program, so you will see a massive performance hit when it's enabled.
I'm the blog post author. I haven't disabled comments (I welcome them) and I don't see any pending ones that I need to approve. I'm not super happy with Blogger, so I'm going to blame it on that. There is one other comment on this post, so the problem probably has to do with the "Comment As" choice and whether you are logged in to that service.
I tried posting before authenticating with my google account; it sent me to google to log in, I did, I returned, comment nowhere, but I was logged in this time. So I went ahead again, commented again, didn't appear. Weird.
Btw I read somewhere that it's best use N-1 cpu-s for actual work and spare the last one for the scheduler so that the goroutines will not be reassigned to different threads thus cpus all the time because that means lots of memory movement. I suspect that was your 9th goroutine.