Wednesday 24th April 2013: 11.20pm. Link shared: https://plus.google.com/109885711759115445224/posts/8WLubGBm1ma
Finally got clang bleeding edge (3.3 trunk) running on ARM hf (it took quite a few runs of trial and error with the build config). I had hoped that my carefully written 4-SHA256 NEON implementation would be super-optimised by clang 3.3's hopefully much superior NEON intrinsic implementation but ...
Niall's nasty 256 bit hash does 8.86986 cycles/byte
Reference SHA-256 hash does 35.9639 cycles/byte
Batch SHA-256 hash does 18.9743 cycles/byte
... which is 47.2408% faster than the straight SHA-256.
Compared to GCC 4.8 on ARM hf:
Niall's nasty 256 bit hash does 3.90832 cycles/byte
Reference SHA-256 hash does 23.2428 cycles/byte
Batch SHA-256 hash does 16.3944 cycles/byte
... which is 29.4649% faster than the straight SHA-256.
So clang 3.3 on ARM hf does much less worse on NEON code than scalar code :) But as for reference SHA-256 - which is bread and butter optimisation - a nearly 50% performance regression is awful. Oh well.