Figured out the non-linear weird timing problem at work ... turns out to be that Qualcomm Snapdragon Krait, the ARM CPU in the Z10 and most high end smartphones from 2012/2013, has weird non-linear latencies on arithmetic instructions. The OpenSSL guys found the same (see http://www.openssl.org/~appro/Snapdragon-S4.html) where the sustained ILP throughput is great and hence the chip's dominance of mobile phones, but if and only if you don't chain register dependencies in sequential instructions because the Krait has up to a two cycle non-deterministic instruction latency (OpenSSL found a 1.8 average latency, I am seeing about a 1.5 cycle average latency in my code, but the latency jumps all over the place randomly). That's plain weird for an out-of-order architecture, and not at all like any other ARM or Intel chip I've programmed against where adding two numbers is 100% a one cycle op, or on in-order architectures like Intel Atom, is 100% a two cycle op. So there you go!
Today in my after hours side project I fired up the unit tests on three weeks of brand new heavily templated code implementing a massively parallel, batch asynchronous file i/o engine (i.e. read lots of threads and execution dependency ordering graphs, very complex), entirely expecting bugs and segfaults galore. Damn thing ran perfect first time on Linux with a perfect valgrind, with one tiny bug in Windows because it can't open directories as files (well it can actually, but not using MSVCRT's POSIX open() implementation). I felt quite giddy actually: I can count the times on two hands when I have ever in my life written twenty hours worth of brand new code and it just worked first time. Tres cool ... :) ... I can't take the credit though. It's writing in C++11 and early C++14 (via Boost) which is the cause: the templates get the compiler to trap most of your bugs at compile time, so when it does finally compile, it just works. Hallelujah!