Optimising core transformation in a block cipher
Author Message Optimising core transformation in a block cipher

I'm trying to optimise the core transformation in a block cipher, which looks
as follows:

DWORD K[ 32 ];        /* Treated as K[ 2 ][ 16 ] */
DWORD S1[ 256 ], S2[ 256 ], S3[ 256 ], S4[ 256 ]; /* 4K total */

tmp = K[ n ] + source;
tmp = rol( tmp, K[ n + 16 ] );
dest ^= S1[ tmp >> 24 ) op S2[ tmp >> 16 ] op S3[ tmp >> 8 ] op S4[ tmp ];

rol = rotate left, op = one of (+, -, xor), mask of S[] index to 8 bits not
shown.

People who read my sci.crypt posts will know what the algorithm is - I'll post
the code there when it's finished.

This transformation is somewhat {*filter*} because almost everything depends on the
previous operation.  In particular the S[] series of operations is:

\ /
\ /
\ /

where:

\ / \ /
\ /

would be somewhat more RISC CPU-friendly because the first two ops are
independant.  Anyway, assuming eax and ebx are used for source and dest and
that esi and edi are initially 0, the best I could get was the following:

; Basic block 1
mov edx, K[ n ]         ; U (pair the slow data reads)
mov ecx, K[ n + 16 ]    ;  V, read as 32-bit value to avoid partial stall
add edx, source         ; U
rol edx, cl             ; U, NP
; Basic block 2
mov ecx, edx            ; U
mov edi, dh             ;  V (is this pairable with the edx read?)
shr ecx, 16             ; U, PU
and edx, 0FFh           ;  V
mov esi, ch             ; U
and ecx, 0FFh           ;  V
mov ebp, S1[ esi * 4 ]  ; U, P5 = AGI, +1
op1 ebp, S2[ ecx * 4 ]  ; U
op2 ebp, S3[ edi * 4 ]  ; U
op3 ebp, S4[ edx * 4 ]  ; U
xor dest, ebp           ; U

(I hope that follows the C code, I'm just trying out ideas at the moment but I
haven't run it yet).  This *looks* like it could execute in 11+5 + 1 AGI = 17
clocks + cache miss time (this is 11 clocks + 5 extra for memory references +
1 extra for the AGI on a P5 + memory access time).  However I'm not sure
whether simultaneous reading from edx and dh is possible.  It's kind of
critical in this case though because you need to push the address calculation
up as far as possible to avoid AGI's later on... there are various other ways
of doing this, but all of them have more AGI's or end up with {*filter*}
non-pairable instructions in awkward places.

Peter.

Sat, 28 Aug 1999 03:00:00 GMT  Optimising core transformation in a block cipher

Quote:

> I'm trying to optimise the core transformation in a block cipher, which looks
> as follows:

>   DWORD K[ 32 ];        /* Treated as K[ 2 ][ 16 ] */
>   DWORD S1[ 256 ], S2[ 256 ], S3[ 256 ], S4[ 256 ]; /* 4K total */

>   tmp = K[ n ] + source;
>   tmp = rol( tmp, K[ n + 16 ] );
>   dest ^= S1[ tmp >> 24 ) op S2[ tmp >> 16 ] op S3[ tmp >> 8 ] op S4[ tmp ];

> rol = rotate left, op = one of (+, -, xor), mask of S[] index to 8 bits not
> shown.

Your description is a bit haizy (inconsistent C and ASM snippets &
insufficient information about variables), but I'll give it a shot...

mov     edx,K[ n ]
mov     ecx,[source]    ; Assuming that source is a memory reference!!!
mov     cl,K[ n + 16]   ; No need to avoid partial stalls

rol     edx,cl          ; A bad, bad instruction ;)

xor     eax,eax
xor     ecx,ecx
mov     al,dl           ; EAX = (tmp >> 8) & 0xFF
mov     cl,dh           ; ECX = tmp & 0xFF
shr     edx,16
xor     ebx,ebx
mov     bl,dh           ; EBX = tmp >> 24
and     edx,0FFh        ; EDX = (tmp >> 16) & 0xFF

mov     eax,S3[eax * 4] ; Avoids linear dependency
mov     ecx,S4[ecx * 4]

op      eax,S1[ebx * 4] ; Avoids imperfect pairing
op      ecx,S2[edx * 4]

op      eax,ecx
mov     ecx,[dest]      ; Assuming that dest is a memory reference!!!
xor     eax,ecx

mov     [dest],eax

It should take about 17 clock cycles on Pentium. The clock
cycle calculation on the asm snippet you provided was, in fact,
more than optimistic ;-)

Now, you didn't specify any loop overhead, etc... So this is as
far as I can optimize it without further description of the
algorithm.

--
==> Vesa Karvonen

An optimizing programmer can always beat a C programmer.

Sat, 28 Aug 1999 03:00:00 GMT  Optimising core transformation in a block cipher

Quote:

> > I'm trying to optimise the core transformation in a block cipher, which looks
> > as follows:

> >   DWORD K[ 32 ];        /* Treated as K[ 2 ][ 16 ] */
> >   DWORD S1[ 256 ], S2[ 256 ], S3[ 256 ], S4[ 256 ]; /* 4K total */

> >   tmp = K[ n ] + source;
> >   tmp = rol( tmp, K[ n + 16 ] );
> >   dest ^= S1[ tmp >> 24 ) op S2[ tmp >> 16 ] op S3[ tmp >> 8 ] op S4[ tmp ];

> > rol = rotate left, op = one of (+, -, xor), mask of S[] index to 8 bits not
> > shown.

> Your description is a bit haizy (inconsistent C and ASM snippets &
> insufficient information about variables), but I'll give it a shot...

Vesa, that shot was a bullseye!

Quote:
>         mov     edx,K[ n ]
>         mov     ecx,[source]    ; Assuming that source is a memory reference!!!
>         mov     cl,K[ n + 16]   ; No need to avoid partial stalls

>         rol     edx,cl          ; A bad, bad instruction ;)

If I remember correctly, ROL reg,CL uses 3 non-pairable cycles, which is
indeed _very_ bad, but there's no real alternative for a variable shift.

Quote:

>         xor     eax,eax
>         xor     ecx,ecx
>         mov     al,dl           ; EAX = (tmp >> 8) & 0xFF
>         mov     cl,dh           ; ECX = tmp & 0xFF

Swap the comments: EAX = tmp & 0xff, ECX = (tmp >> 8) & 0xff

Quote:
>         shr     edx,16
>         xor     ebx,ebx
>         mov     bl,dh           ; EBX = tmp >> 24
>         and     edx,0FFh        ; EDX = (tmp >> 16) & 0xFF

On a PPro (or 386!) you can speed this up slightly (1 cycle) by using
MOVZX, but that would be horrible on 486 and Pentium cpus:

movzx eax,dl
movzx ecx,dh

shr edx,16
; Doing XOR EBX,EBX here is free because it would pair with the SHR.

movzx ebx,dh
and edx,0ffh

Quote:
>         mov     eax,S3[eax * 4] ; Avoids linear dependency
>         mov     ecx,S4[ecx * 4]

>         op      eax,S1[ebx * 4] ; Avoids imperfect pairing
>         op      ecx,S2[edx * 4]

>         op      eax,ecx
>         mov     ecx,[dest]      ; Assuming that dest is a memory reference!!!
>         xor     eax,ecx

>         mov     [dest],eax

The last two instructions can probably be overlapped, either with the
loop control code, or with the first two loads of the second iteration.

Quote:
> It should take about 17 clock cycles on Pentium. The clock
> cycle calculation on the asm snippet you provided was, in fact,
> more than optimistic ;-)

Indeed! :-)

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Sun, 29 Aug 1999 03:00:00 GMT  Optimising core transformation in a block cipher

Quote:

>>I'm trying to optimise the core transformation in a block cipher, which looks
>>as follows:

>>  DWORD K[ 32 ];        /* Treated as K[ 2 ][ 16 ] */
>>  DWORD S1[ 256 ], S2[ 256 ], S3[ 256 ], S4[ 256 ]; /* 4K total */

>>  tmp = K[ n ] + source;
>>  tmp = rol( tmp, K[ n + 16 ] );
>>  dest ^= S1[ tmp >> 24 ) op S2[ tmp >> 16 ] op S3[ tmp >> 8 ] op S4[ tmp ];

>>rol = rotate left, op = one of (+, -, xor), mask of S[] index to 8 bits not
>>shown.

>Your description is a bit haizy (inconsistent C and ASM snippets &
>insufficient information about variables),

Sorry about that, I tried to transcribe the C code (which makes extensive use
of macros) into a minimal representation.  The full version (which is a lot
harder to follow) is included below.  The different 'op' operations are handled
by defining multiple macros (f1(), f2(), f3(), round1(), round2(), round3())
which perform the appropriate operation.

-- Snip --

/* Macros to extract 8-bit values a, b, c, d from a 32-bit value.  The cast
is necessary because some compilers prefer ints as array indices */

#define exta(x)     ( ( int ) ( ( x >> 24 ) & 0xFF ) )
#define extb(x)     ( ( int ) ( ( x >> 16 ) & 0xFF ) )
#define extc(x)     ( ( int ) ( ( x >> 8 ) & 0xFF ) )
#define extd(x)     ( ( int ) ( ( x ) & 0xFF ) )

/* The f-functions */

#define f1( data )  \
( ( ( S1[ exta( data ) ] ^ S2[ extb( data ) ] ) - \
S3[ extc( data ) ] ) + S4[ extd( data ) ] )

#define f2( data )  \
( ( ( S1[ exta( data ) ] - S2[ extb( data ) ] ) + \
S3[ extc( data ) ] ) ^ S4[ extd( data ) ] )

#define f3( data )  \
( ( ( S1[ exta( data ) ] + S2[ extb( data ) ] ) ^ \
S3[ extc( data ) ] ) - S4[ extd( data ) ] )

/* The individual encrypt/decrypt rounds */

#define round1( count, output, input )  \
tmp = MASK32( K[ count - 1 ] + input ); \
tmp = ROTL( tmp, K[ count + 15 ] ); \
output ^= f1( tmp );

#define round2( count, output, input )  \
tmp = MASK32( K[ count - 1 ] ^ input ); \
tmp = ROTL( tmp, K[ count + 15 ] ); \
output ^= f2( tmp );

#define round3( count, output, input )  \
tmp = MASK32( K[ count - 1 ] - input ); \
tmp = ROTL( tmp, K[ count + 15 ] ); \
output ^= f3( tmp );

/* The encrypt function */

[L, R = 32-bit data values]

/* Perform 16 rounds of encryption */
round1(  1, L, R );
round2(  2, R, L );
round3(  3, L, R );
round1(  4, R, L );
round2(  5, L, R );
round3(  6, R, L );
round1(  7, L, R );
round2(  8, R, L );
round3(  9, L, R );
round1( 10, R, L );
round2( 11, L, R );
round3( 12, R, L );
round1( 13, L, R );
round2( 14, R, L );
round3( 15, L, R );
round1( 16, R, L );

-- Snip --

Quote:
>but I'll give it a shot...

>       mov     edx,K[ n ]
>       mov     ecx,[source]    ; Assuming that source is a memory reference!!!

Actually source and dest are registers - I think I mentioned that they were eax
and ebx in the original.  It doesn't matter anyway, you can just use esi and
edi, because your version doesn't use these registers.

Quote:
>       mov     cl,K[ n + 16]   ; No need to avoid partial stalls

OK, I've reread the Intel docs and can see that this wasn't a problem anyway -
it's only if you read a full register after writing a partial register that you
run into problems.

Quote:
>       rol     edx,cl          ; A bad, bad instruction ;)

Data-dependant rotates are gaining popularity in encryption algorithms, they
have some very nice properties which make certain types of cryptanalysis
difficult.  One algorithm, RC5, even makes this a (patented) feature of the
algorithm - it consists of almost nothing but rotates and XOR's.  Terje
mentioned that each rotate uses 3 non-pairable cycles, which is probably
something the RC5 designer didn't anticipate.

- Show quoted text -

Quote:
>       xor     eax,eax
>       xor     ecx,ecx
>       mov     al,dl           ; ECX = (tmp >> 8) & 0xFF
>       mov     cl,dh           ; EAX = tmp & 0xFF
>       shr     edx,16
>       xor     ebx,ebx
>       mov     bl,dh           ; EBX = tmp >> 24
>       and     edx,0FFh        ; EDX = (tmp >> 16) & 0xFF

>       mov     eax,S3[eax * 4] ; Avoids linear dependency
>       mov     ecx,S4[ecx * 4]

>       op      eax,S1[ebx * 4] ; Avoids imperfect pairing
>       op      ecx,S2[edx * 4]

>       op      eax,ecx
>       mov     ecx,[dest]      ; Assuming that dest is a memory reference!!!
>       xor     eax,ecx

>       mov     [dest],eax

Neat.  However there's a problem due to the somewhat ambiguous original
explanation, when I said "op = one of (+, -, xor)" what I meant was that one
occurrence of 'op' was '+', one was '-', and one was 'xor', ie the expression
is non-associative.  For example in one round you might have 'S1 + S2 xor S3 -
S4', and you can't swap the order of evaluation around.  This is why I had the
succession of operations with {*filter*} linear dependencies at the end of the
original code.

Quote:
>It should take about 17 clock cycles on Pentium. The clock cycle calculation
>on the asm snippet you provided was, in fact, more than optimistic ;-)

Oops.  The last x86 programming I did was on a 386, I assumed that on the P5
pretty much everything (including rotates) executed in one clock.

Quote:
>Now, you didn't specify any loop overhead, etc... So this is as far as I can
>optimize it without further description of the algorithm.

It's traditional to unroll the loops for these things, I don't think there's
much to be gained by trying to do it in a loop, especially since the round
functions aren't very constant.  The code will be executed a great many times
in succession, so it'll all be in the I-cache after the first time.

Quote:
>>        mov     al,dl           ; EAX = (tmp >> 8) & 0xFF
>>        mov     cl,dh           ; ECX = tmp & 0xFF

>Swap the comments: EAX = tmp & 0xff, ECX = (tmp >> 8) & 0xff

You actually need to swap the source registers.  S3 is indexed using
( tmp >> 8 ) & 0xFF, S4 is indexed using tmp & 0xFF, so it should be:

mov al,dh           ; EAX = (tmp >> 8) & 0xFF
mov cl,dl           ; ECX = tmp & 0xFF

because later on you have:

mov eax,S3[eax * 4] ; Avoids linear dependency
mov ecx,S4[ecx * 4]

Peter.

Mon, 30 Aug 1999 03:00:00 GMT  Optimising core transformation in a block cipher

Quote:

[snip]
> >Your description is a bit haizy (inconsistent C and ASM snippets &
> >insufficient information about variables),

> Sorry about that, I tried to transcribe the C code (which makes extensive use
> of macros) into a minimal representation.  The full version (which is a lot
> harder to follow) is included below.  The different 'op' operations are handled
> by defining multiple macros (f1(), f2(), f3(), round1(), round2(), round3())
> which perform the appropriate operation.

You are not the only one to blame. I should have read the
description much more carefully. I have a habit of _trying_
to read the messages as quickly as possible. I seem to have
missed a couple of important bits.

[snip - the C version]

Quote:
> #define round1( count, output, input )  \
>     tmp = MASK32( K[ count - 1 ] + input ); \
>     tmp = ROTL( tmp, K[ count + 15 ] ); \
>     output ^= f1( tmp );

The MASK32() macro(?) is missing? I am assuming
that it basically does nothing on a 32-bit machine.

[snip - the C version]

Quote:
> >       mov     edx,K[ n ]
> >       mov     ecx,[source]    ; Assuming that source is a memory reference!!!

> Actually source and dest are registers - I think I mentioned that they were eax
> and ebx in the original.  It doesn't matter anyway, you can just use esi and
> edi, because your version doesn't use these registers.

Ok, that, however, changes some of the pairing opportunities.

[snip]

Quote:
> >       rol     edx,cl          ; A bad, bad instruction ;)

> Data-dependant rotates are gaining popularity in encryption algorithms, they
> have some very nice properties which make certain types of cryptanalysis
> difficult.  One algorithm, RC5, even makes this a (patented) feature of the
> algorithm - it consists of almost nothing but rotates and XOR's.  Terje
> mentioned that each rotate uses 3 non-pairable cycles, which is probably
> something the RC5 designer didn't anticipate.

It's actually 4-5 non-pariable cycles according to the
references that I have, but nevertheless it is a slow
instruction. Unfortunately it is very difficult to
avoid using the instruction, except when the rotate
count is either really constant or changes very
infrequently and is constrained to a small range.

On the PPro, shift and rotate instructions have been
optimized (according to references that I have), so the
code will execute (relatively) slightly faster on a PPro.

[snip]

Quote:
> Neat.  However there's a problem due to the somewhat ambiguous original
> explanation, when I said "op = one of (+, -, xor)" what I meant was that one
> occurrence of 'op' was '+', one was '-', and one was 'xor', ie the expression
> is non-associative.  For example in one round you might have 'S1 + S2 xor S3 -
> S4', and you can't swap the order of evaluation around.  This is why I had the
> succession of operations with {*filter*} linear dependencies at the end of the
> original code.

That's a minor problem (for a two pipe machine). Basically
you then just have to interleave as many of the loads with
op's as possible without breaking the dependencies.

Quote:
> >Now, you didn't specify any loop overhead, etc... So this is as far as I can
> >optimize it without further description of the algorithm.

> It's traditional to unroll the loops for these things, I don't think there's
> much to be gained by trying to do it in a loop, especially since the round
> functions aren't very constant.  The code will be executed a great many times
> in succession, so it'll all be in the I-cache after the first time.

You are right. It should fit into the instruction cache
quite happily.

Ok, I'll have another try... This time I'll be using symbols
instead of registers, so that you can use the code more easily,
should you choose to do so. Beware of any bugs, they tend to
appear frequently when optimizing code while online.

; Variables:
src     = esi           ; source
dst     = edi           ; dest

; Counter register for shifts
cnt     = ecx
cntl    = cl
cnth    = ch

; Arbitrary "x" register
rx0     = eax   ; Note: and rx0,0FFh is one byte smaller if
rx0l    = al    ; rx0 = eax. It has no impact on performance.
rx0h    = ah
rx1     = edx
rx1l    = dl
rx1h    = dh
rx2     = ebx
rx2l    = bl
rx2h    = bh

; Other parameters (for instance):
n       = 1
op2     = xor
op3     = sub

; Here we go:

mov     rx0,K[n]        ; Can this be badly aligned?
mov     cntl,K[16+n]

add     cntl,16         ; A quick and dirty trick...

rol     rx0,cntl

xor     rx1,rx1
xor     cnt,cnt         ; ...the trick allows access to:

mov     rx1l,rx0h       ; rx1 = (tmp >> 24) & 0xFF
mov     cntl,rx0l       ; cnt = (tmp >> 16) & 0xFF

shr     rx0,16
xor     rx2,rx2

mov     rx2l,rx0h       ; rx2 = (tmp >> 8) & 0xFF
mov     rx1,S1[rx1*4]

and     rx0,0FFh        ; rx0 = tmp & 0xFF
mov     cnt,S2[cnt*4]

op1     rx1,cnt
mov     rx2,S3[rx2*4]

op2     rx1,rx2
mov     rx0,S4[rx0*4]

op3     rx1,rx0

xor     dst,rx1

[...]

The above snippet takes no less than 15 cycles on the Pentium
in the best case.

As you can see, I used a trick to get the upper 16-bits of
the temporary register (rx0) into more convenient bit positions.
This trick may, in fact, be redundant. It depends on whether
or not the exact order of operations and LUT's is important or
not. Because I don't know that, I did not optimize it. Note,
however, that removing the redundant(?) add will not increase
performance.

As we can see we have an even number of instruction pairs
(the rol-instruction is unpairable) so we can improve this
slightly, to a minimum through-put of 14 cycles, when we do
multiple rounds in a row by interleaving the end of the
previous round with the beginning of the new round:

[...]
op2     rx1,rx2         ; Round1        U
mov     rx0,S4[rx0*4]   ; Round1        V

op3     rx1,rx0         ; Round1        U

; Here we remap some of our parameters and variables:
n       = 2             ; Use whatever values you need.
op1     = xor
op2     = sub

mov     rx0,K[n]        ; Round2        V

xor     dst,rx1         ; Round1        U
mov     cntl,K[16+n]    ; Round2        V

; Here we remap src and dst:
TMP     = src           ; Swap src & dst
src     = dst
dst     = TMP

add     rx0,src         ; Round2
add     cntl,16         ; Round2
[...]

If there are no major bugs or misconceptions in the above
code examples, then you should have no problems implementing
the above in assembly language. Note that there is only a
free register (ebp if you do not change the register mappings).

--
==> Vesa Karvonen

An optimizing programmer can always beat a C programmer.

Mon, 30 Aug 1999 03:00:00 GMT

 Page 1 of 1 [ 5 post ]

Relevant Pages