alt.lang.asm vs. comp.lang.asm.x86 
Author Message
 alt.lang.asm vs. comp.lang.asm.x86

Could someone please explain the difference between alt.lang.asm and
comp.lang.asm.x86?

Are _all_ of the alt.lang.asm posts being echoed over herer?

Just asking, because reading through every post twice is getting rather
monoatonous.



Sun, 27 Jul 1997 14:58:54 GMT  
 alt.lang.asm vs. comp.lang.asm.x86

I recently wrote a program which added a constant to an array, storing
the result in another array.  I then decided to optimize it, so to get
timing results I decided to interate the function 10000 times.  Now, on
the pentium's 8k data cache, one would expect the cache to be completely
replaced by the time the function finished (as you'll see, the array is
15000 words), and the program got around to accessing the same memory
location on the next iteration.

Thus the cache should be completely useless.  I decided to use it partly
by preloading as much of the array as would fit in the cache, and then
"freezing" it (by moving 1's into bits 31 and 30 of cr0).  In theory,
this should speed up at least some memory accesses.  However, when I ran
the following program, performance was over 4 TIMES WORSE.  Does anyone
know why this would occur?

Oh, btw, I would also welcome any optimization of the adding loop...

(compiled using TASM 4, for the pentium -- however I'll bet that the
cache problem persists on the 486.)

        IDEAL
        MODEL LARGE
        STACK 256
        P586

SEGMENT data PAGE
ALIGN 32
Array1  DW   15000 DUP (5)
Array2  DW   15000 DUP (?)
ENDS    data

ADDAMT  EQU 5

        CODESEG
PROC    Main

        mov     ax,data         ; Get my data seg
        mov     ds,ax           ; DS points to start of Array1
        add     ax,30000/16     ; Get data seg for Array2
        mov     es,ax           ; Set es to point to the start of Array2

        mov     ebx,ADDAMT      ; This is the amount to add by.  The idea is
        mov     edx,ebx         ; to add 2 words at once by bringing in a dword
        shl     edx,16          ; and doing 2 adds: add _,bx and add _,edx.

; HERE'S the CACHE STUFF!
        mov     si,OFFSET Array1
        mov     cx,2*1024       ; Pre-load the cache 8k (2048*lodsD = 8k).
        rep     lodsd

        mov     eax,cr0  ; Freeze cache.
        push    eax
        or      eax,01100000000000000000000000000000b  ;Hex? what's that? :)
        mov     cr0,eax

        mov     bp,2500           ; Do the Array add 2500 times

        mov     si,OFFSET Array1
        mov     di,si
        sub     di,8              ; We'll add it early in the loop...

               ; I do 2 per 32 bit load, and I do 2 32bit regs simultaneously
               ; to utilise the V-pipe (otherwise dependancies would not allow
               ; it.  I challenge people to do it faster!  TIME your code!
               ; Cycle counting is inaccurate, because we are memory-bound.
                                   ; Pentium Clocks:
        mov     eax,[ds:si]        ; 2  :1 for 32 bit prefix + 1 for inst.
        add     di,8               ;     V - pipe
        mov     ecx,[ds:si+4]      ; 4  :1 prefix + 1
        add     ax,bx              ;    Splitting the adds ensures correct
                                   ;    overflow (ie acts like adding 2 sep)
        add     ecx,edx            ; 8  
        add     cx,bx
        add     eax,edx            ; 6  this won't occur, we can remove 2
        add     si,8

;        mov     [ds:si+20000],eax ;    This will work too, but results in
;        mov     [ds:si+(20004-8)],ecx  ; longer code.

        mov     [es:di],eax        ; 10  
        cmp     si,30000           ;
        mov     [es:di+4],ecx      ; 11

        sub     bp,1

        sti
        wbinvd                  ; Done, Turn the cache back on!
        pop     eax
        mov     cr0,eax

        mov     ah,4ch          ; Fade away...
        int     21h
ENDP Main

END Main

--
                                             Andy Stone      



Mon, 28 Jul 1997 04:25:36 GMT  
 alt.lang.asm vs. comp.lang.asm.x86

: Could someone please explain the difference between alt.lang.asm and
: comp.lang.asm.x86?

: Are _all_ of the alt.lang.asm posts being echoed over herer?

: Just asking, because reading through every post twice is getting rather
: monoatonous.

c.l.a.x is focused on just the intel x86 assembly language.  a.l.a is for
all processors.  

Later this month, I will post the FAQ again.  The difference is there.

Ray



Mon, 28 Jul 1997 10:19:55 GMT  
 alt.lang.asm vs. comp.lang.asm.x86

Quote:

>I recently wrote a program which added a constant to an array, storing
>the result in another array.  I then decided to optimize it, so to get
>timing results I decided to interate the function 10000 times.  Now, on
>the pentium's 8k data cache, one would expect the cache to be completely
>replaced by the time the function finished (as you'll see, the array is
>15000 words), and the program got around to accessing the same memory
>location on the next iteration.

>Thus the cache should be completely useless.  I decided to use it partly
>by preloading as much of the array as would fit in the cache, and then
>"freezing" it (by moving 1's into bits 31 and 30 of cr0).  In theory,
>this should speed up at least some memory accesses.  However, when I ran
>the following program, performance was over 4 TIMES WORSE.  Does anyone
>know why this would occur?

Assuming that on the Pentium the instruction and data caches are
both controlled by bits 31 and 30 in cr0, you are locking down
the instruction cache too. You are probably doing that before
executing the code to perform your desired operations, and thus
probably that code never gets into the cache. Or are you having a
null run through that code to load the instruction cache?

If you are not making sure the instruction cache is pre-loaded with
the instructions of your loop, the chip is forced to go to memory
for each instruction. Thats deadly to performance! On a 486 it is
really bad (~ 2.5X slowdown, tested with a spin-loop* with no
memory accesses) to have code not being cached. It might be even
worse on a Pentium, causing pipeline stalls, etc.

If you want your idea to work, get the instructions loaded into
the cache before locking the cache like that. It still might not work,
but this should improve the odds considerably.

* This is the Linux Bogomips delay loop which is what I used to get
the 2.5X slowdown figure:

/*
 * Copyright (C) 1993 Linus Torvalds
 *
 * Delay routines, using a pre-computed "loops_per_second" value.
 */

extern __inline__ void __delay(int loops)
{
        __asm__(".align 2,0x90\n1:\tdecl %0\n\tjns 1b": :"a" (loops):"ax");

Quote:
}

...


Tue, 29 Jul 1997 13:49:32 GMT  
 alt.lang.asm vs. comp.lang.asm.x86

Quote:

>  I recently wrote a program which added a constant to an array,
>  storing the result in another array.  I then decided to optimize
>  it, so to get timing results I decided to interate the function
>  10000 times.  Now, on the pentium's 8k data cache, one would
>  expect the cache to be completely replaced by the time the
>  function finished (as you'll see, the array is 15000 words), and
>  the program got around to accessing the same memory location on
>  the next iteration.
>  Thus the cache should be completely useless.  I decided to use it
>  partly by preloading as much of the array as would fit in the
>  cache, and then "freezing" it (by moving 1's into bits 31 and 30
>  of cr0).  In theory, this should speed up at least some memory
>  accesses.  However, when I ran the following program, performance
>  was over 4 TIMES WORSE.  Does anyone know why this would occur?

None of the gurus having answered, I'll take a shot:

When you access memory which is not in the cache, the memory subsystem
brings not just the requested memory but the entire cache line (32
bits on the Pentium) into cache. Therefore, because you are accessing
the memory sequentially, most of your requests are indeed being filled
from cache.

Furthermore, because you are freezing cache before your loop is
resident in it, your running code is also forced to execute from
memory. Given these two slowdowns, I'm surprised your actual timings
were only 4 times slower than with the cache running; I'd have
expected worse.

Quote:
> Oh, btw, I would also welcome any optimization of the adding loop...

OK, I'll add my 2 centimes there, too...

How are you dealing with overflow? This code sort of makes sense if
you've hooked interrupt 4 and enabled the overflow interrupt;
otherwise, I don't understand why you're making it so hard.

If overflow is impossible, you can do the adds in parallel, in 32-bit
mode, eliminating the operand-size overrides and segment overrides
your original code used.

The following code assumes 32-bit mode. If you must work in 16-bit
mode, the improved V-pipe usage of avoiding the 66h prefix may be more
important than the 32-bit width the code below exploits to run at
less than a cycle per word, not counting memory stalls. That is, in
16-bit mode, just using ax instead of eax, etc., seems best.

Also, the code is big, as I was optimizing for speed, not size.

;  [ Stuff elided for brevity ]

        push    ebp             ; Save this; we'll be using it.
        mov     ebp,ADDAMT      ; This is the amount to add by.
        mov     edx,ebp            ; We'll put it in both halves of ebp.
        shr     ebp,16
        add     ebp,edx

;  [ Cache stuff dropped ]

        mov     esi,(OFFSET EndArray1) - 16
; previous instruction points to last 16 bytes of Array1
        mov     edi,(OFFSET Array1)
        sub     esi,edi         ; get number of bytes to process
        js      Handle_Last_Few_Words

               ; I do 2 per 32 bit load, and I do 4 32bit regs
           ; simultaneously to better utilise the V-pipe.
           ; Also, I access slightly out-of-order to avoid
           ; simultaneous access to the same cache bank.

           ; I do not supply fix-up code, for the last few
           ; numbers when the vector length is not a multiple
           ; of 32 words.

        mov     eax,[edi+esi]
        mov     ebx,[edi+esi+8]
        add     eax,ebp
        mov     ecx,[edi+esi+4]
        add     ebx,ebp
        mov     edx,[edi+esi+12]
        add     ecx,ebp
        mov     [edi+esi+(OFFSET Array2)-(OFFSET Array1)],eax
        add     edx,ebp
        mov     [edi+esi+8+(OFFSET Array2)-(OFFSET Array1)],ebx
        mov     [edi+esi+4+(OFFSET Array2)-(OFFSET Array1)],ecx
        mov     [edi+esi+12+(OFFSET Array2)-(OFFSET Array1)],edx
        sub     esi,16

Handle_Last_Few_Words:

        add     esi,16
        jz      All_Done

; otherwise, we have a few words still to do.
;  [ left as an exercise :-) ]

        pop    ebp             ; Restore this; we're done with it.

-----

---
 t RM 1.3 02158 t



Wed, 30 Jul 1997 06:03:12 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. RFD: alt.lang.asm.68k, alt.lang.asm.x86, alt.lang

2. RFD: alt.lang.asm.68k, alt.lang.asm.x86, alt.lang.asm.68xx, etc...

3. alt.lang.ml->alt.lang.asm

4. CFV: comp.lang.asm.x86

5. comp.lang.asm.x86 FAQ (preliminary)

6. RFD: comp.lang.asm.x86

7. CFV: comp.lang.asm.x86

8. comp.lang.asm.x86 FAQ (preliminary)

9. RFD: comp.lang.asm.x86

10. RFD: comp.lang.asm.x86

11. comp.lang.asm.x86 is too slow!

12. comp.lang.asm.x86?

 

 
Powered by phpBB® Forum Software