[Stk] problems implementing polyphony...

Tue Sep 18 15:37:15 PDT 2012

This is difficult.  I have been playing with the code and gcc options
and gprof, and it seems there is no specific bottleneck.  HevyMetl is
about 15% the speed of Clarinet.  I managed to get it down to about
the same speed, but I had to do several things:

- moved several functions into their headers for inlining
(FileLoop::setRate, FileWvIn::tick, etc.. anything that is referenced
from HevyMetl::tick.)

- set some gcc options to force inlining of as much as possible,
e.g. -Winline --param inline-unit-growth=65536 -finline-limit=65536

- used link-time optimisation available in gcc 3.6 and up  (-flto on all code)

- set -ffast-math

Even then it is not quite as fast.  I found fmod used in FileLoop was
a bit of a bottleneck.

In general I find it pretty surprising that gcc doesn't succeed in
speeding this up more, but FileLoop seems to be a bit of a problem for
reasons that aren't clear to me. I sprinkled the code with checks for
denormals and came up empty.  I checked the assembler and used gprof
and -Winline to make sure inlining was working as expected.

Oh, I should mention this was on my fast desktop computer, not an ARM
tablet, so proper profiling on the target hardware may be warranted.

In any case, Morgan's right in that the vectorised versions are
probably better to use "in production," since the whole "inline" thing
in C/C++ is not supposed to be fully relied on for efficiency.  (e.g.
the compiler might choose not to inline due to code size rather than
speed.)  The per-sample tick functions are however important for
certain algorithms, and generally useful as a teaching tool, so their
presence in STK is desirable.  That said, usually a vectorised
approach is preferred in application code.

Just a couple of notes...

On Tue, Sep 18, 2012 at 11:41 AM, Morgan Packard
<morgan at morganpackard.com> wrote:
> I did some of my own experimentation, which seems to point to method calls
> themselves (even with all of the calculation inside them commented out)
> being responsible for much of the cpu use of HevyMetl.
>
> I've been using STK all along with the assumption that all calls to tick()
> without frames, or any other per-sample function call was going to be
> significantly less efficient than operating on buffers. I'm aware of the
> existence of inlining, but not savvy enough to understand if it's happening
> or not, and under what conditions it can happen. It seems suspicious to me
> to think that inlining could happen on pointers. I mean, if you have a
> pointer to an stk::Generator, and you call tick() on it, I don't see how the
> compiler could know ahead of time which subclass of Generator it should be
> inlining.

That's true, but most STK code uses the final child class, not the
superclass, so the compiler should have all the information it needs.

> I'd love to find out that all my meticulous buffer passing in order to get
> reasonably performant code is unnecessary, but until I understand otherwise,
> or better, I'm working with the assumption that method calls are expensive
> and best to minimize. Another thing I like about calculating samples in
> batches/buffers/stkframes is it allows me to use Apple's accelerate
> framework, which offers some very nice performance boosts.
>
> However, I certainly trust Gary's assertion that this HevyMetl ran just fine
> on 90's machines, and I'm very curious about what has changed. Has the code
> itself changed, breaking inlining? Is there something about method calls on
> the Apple hardware that makes them much more expensive than on Gary's 90's
> hardware?

I think one experiment would be to compare run-times of
HevyMetl::tick() for previous versions of STK.  My meticulous building
of a git archive of the previous tarballs might finally pay off!

https://github.com/radarsat1/stk/commits/upstream

If a really old version does turn out to be faster, I was thinking
maybe a well-crafted "git bisect" command might help get to the bottom
of this.

Steve