Re: Optimising csum_fold()
From: Willy Tarreau
Date: Tue Nov 22 2022 - 11:25:42 EST
On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> There are currently 20 copies of csum_fold(), some in C some in assembler.
> The default C version (in asm-generic/checksum.h) is pretty horrid.
> Some of the asm versions (including x86 and x86-64) aren't much better.
>
> There are 3 pretty good C versions:
> 1: (~sum - rol32(sum, 16)) >> 16
> 2: ~(sum + rol32(sum, 16)) >> 16
> 3: (u16)~((sum + rol32(sum, 16)) >> 16)
> All three are (usually) 4 arithmetic instructions.
>
> The first two have the advantage that the high bits are zero.
> Relevant when the value is being checked rather than set.
>
> The first one can generate better instruction scheduling (the rotate
> and invert can be executed in the same clock).
>
> The 3rd one saves an instruction on arm, but may need masking.
> (I've not compiled an arm kernel to see how often that happens.)
>
> The only architectures where (I think) the current asm code is better
> than the C above are sparc and sparc64.
> Sparc doesn't have a rotate instruction, but does have a carry flag.
> This makes the current asm version one instruction shorter.
>
> For architectures like mips and risc-v which have neither rotate
> instructions nor carry flags the C is as good as the current asm.
> The rotate is 3 instructions - the same as the extra cmp+add.
>
> Changing everything to use [1] would improve quite a few architectures
> while only adding 1 clock to some paths in arm/arm64 and sparc.
>
> Unfortunately it is all currently a mess.
> Most architectures don't include asm-generic/checksum.h at all.
>
> Thoughts?
Then why not just have one version per arch, the most efficient one,
and use it everywhere ? The simple fact that we're discussing the
tradeoffs means that if we don't want to compromise performance here
(which I assume to be the case), then it needs to be per-arch and
that's all. At least that's the way I understand it.
Regards,
Willy