Fast 68000 memory copy routine. 
Author Message
 Fast 68000 memory copy routine.

Someone suggested that I post this in alt.lang.asm. A description of where
this could be found by ftp was originally posted to comp.compression.

Since posting it in comp.compression, someone has pointed out that the
move can be performed faster for relatively word aligned blocks using
the 68000's MOVEM instruction. I have calculated that this will
yield a 13% increase in speed. I haven't got the time to make this
change now. However, the code is public domain so anyone who wants to
can. If you modify this code, please place a note at the top indicating
that you have done so before passing it on.

Enjoy,

Ross Williams

FAST_COPY.68000
===============
Author       : Ross N. Williams.
Release date : 16-Jun-1991.

1. This file contains two modules:

   1) A fast memory block copy routine written in 68000 machine code.
   2) A test package written in C.

2. The  block copy  routine has  been written  for speed.  The routine
examines the relative  alignment of the source  and destination blocks
(byte,  word,  or  longword)  and   uses  the  widest  of  three  move
instructions  (move.b, move.w,  or move.l)  possible on  a 68000.  The
chosen instruction is  then used in an unrolled (by  16) move loop. As
far as I can  see, a faster copy could only  be obtained by increasing
the unrolling which will yield only minor improvements.

3. WARNING:  On later  versions of the  680x0 family  that do  lots of
sophisticated  speed  tricks  (such   as  prefetch  and  caching)  the
unrolling used in this package may  actually be slower that a straight
tightloop. I don't  know. The simplest way  to find out is  to try it.
Perhaps someone could release a  version that takes the processor type
into account as well as the block alignment.

4.  The code  was written  under Lightspeed  C on  a Macintosh.  Minor
changes may be required to port it to another environment (e.g. expand
macros). In particular, I have omitted  some header files for the test
module (for assertions  and portable type definitions and  so on) that
are probably  more easily  rewritten for  the target  environment than
copied. The code was never  designed for portability. Nevertheless, it
should be fairly easy to port.

5. This code is public domain.

6. If you use this code in anger (e.g. in a product) drop me a note at

invoked if anyone finds a bug in this code.

7.   The  internet   newsgroup  comp.compression   might  also   carry
information on this algorithm from time to time.

8.  This code  was  developed as  part of  the  fast data  compression
algorithm LZRW1.  When the  algorithm finds that  it has  expanded the
data instead of compressing it, it  starts over using a copy operation
instead of a compression one.

/******************************************************************************/
/*                                                                            */
/*                                  FAST_COPY.C                               */
/*                                                                            */
/******************************************************************************/
/* Author : Ross Williams.                                                    */
/* Date   : 13-Apr-1990.                                                      */
/*                                                                            */
/* This module contains a function called fast_copy that copies a block of    */
/* memory extremely quickly using unrolled loops. The actual speed of the     */
/* copy depends on the relative alignment of the source and destination       */
/* blocks of memory. If the source and destination blocks are relatively      */
/* longword aligned, the copy will go faster than if they are merely          */
/* relatively byte aligned.                                                   */
/******************************************************************************/

#include "fast_copy.h"          /* Just the function prototype.               */

void fast_copy(src_adr,dst_adr,src_len)
/* This function copies a block of memory very quickly.                       */
/* The exact speed depends on the relative alignment of the blocks of memory. */
/* PRE  : 0<=src_len<=(2^32)-1 .                                              */
/* PRE  : Source and destination blocks must not overlap.                     */
/* POST : MEM[dst_adr,dst_adr+src_len-1]=MEM[src_adr,src_adr+src_len-1].      */
/* POST : MEM[dst_adr,dst_adr+src_len-1] is the only memory changed.          */
void         *src_adr;
void         *dst_adr;
unsigned long src_len;
{
 asm 68000
   {
    ;Outline of Algorithm
    ;--------------------
    ;1. Copy from 0 to 3 bytes to make A_SRC longword aligned.
    ;2. Choose byte, word, or longword move depending on alignment of A_DST.
    ;3. Execute a high speed unrolled loop to move most of the bytes.
    ;4. Finish off the remainder of the bytes one at a time.

    ;Register Map
    ;------------
    #define D_LEN    d0   ;Number of bytes left to copy.
    #define D_UNROLL d1   ;Counts unrolled loop body executions.
    #define D_T1     d2   ;Temporary register.
    #define A_SRC    a0   ;Points to the next source      byte.
    #define A_DST    a1   ;Points to the next destination byte.

    ;Note: Lightspeed C doesn't mind us using a0-a1 and d0-d2.
    ;      So there is no need to save any registers.

    ;Load key registers with parameters.
    move.l src_adr, A_SRC
    move.l dst_adr, A_DST
    move.l src_len, D_LEN

    ;Skip to the finishing off loop if there is not even enough
    ;bytes to allow a longword alignment (i.e. jump if D_LEN<3).
    cmp.l  #3,      D_LEN


    ;Copy bytes (up to 3) until A_SRC is longword aligned.
    move.l A_SRC, D_T1
    and.l  #3,    D_T1

    move.b (A_SRC)+,(A_DST)+
    sub.l  #1,    D_LEN


    ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN).
    ;Assert: A_SRC is longword aligned.
    ;Choose the longest unit move loop based on the alignment of A_DST.
    move.l A_DST, D_T1
    and.l  #3,    D_T1

    and.l  #1,    D_T1


 #define COPY_UNROLLED(SIZE,SHIFT,LABEL)                      \
    ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN).   \
    ;Assert: A_SRC and A_DST are SIZE aligned.                \
    ;Perform D_LEN/(2^SHIFT) (2^SHIFT)-byte copy operations.  \
    ;Each (2^SHIFT) bytes is moved as 16 SIZE.                \
    ;D_UNROLL counts down the number of blocks.               \
    move.l D_LEN ,D_UNROLL             \
    lsr.l  SHIFT ,D_UNROLL             \

    ;Assert: D_UNROLL>0                \
    move.l D_UNROLL,D_T1               \
    lsl.l  SHIFT ,D_T1                 \
    sub.l  D_T1  ,D_LEN                \
 LABEL:                                \
    move.SIZE (A_SRC)+,(A_DST)+   ; 1  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 2  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 3  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 4  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 5  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 6  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 7  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 8  \
    move.SIZE (A_SRC)+,(A_DST)+   ; 9  \
    move.SIZE (A_SRC)+,(A_DST)+   ;10  \
    move.SIZE (A_SRC)+,(A_DST)+   ;11  \
    move.SIZE (A_SRC)+,(A_DST)+   ;12  \
    move.SIZE (A_SRC)+,(A_DST)+   ;13  \
    move.SIZE (A_SRC)+,(A_DST)+   ;14  \
    move.SIZE (A_SRC)+,(A_DST)+   ;15  \
    move.SIZE (A_SRC)+,(A_DST)+   ;16  \
    sub.l     #1, D_UNROLL             \
    bne       LABEL                    \

    ;Note: In an earlier version of this module, the faster DBRA instruction
    ;      was used. However, it imposed a maximum of (2^15-1)*16 on src_len
    ;      so I changed it to the sub/bne combination. I suppose I could have
    ;      nested DBRAs but this is simpler and not much slower.








    ;Assert: We have to perform a copy (A_SRC,A_DST,D_LEN).
    ;Assert: D_LEN<64.
    ;The following rolled, single-byte copy loop finishes off the copy.
    tst.l  D_LEN

    sub.l  #1,   D_LEN

    move.b  (A_SRC)+,(A_DST)+


   } /* End of assembly language code. */

Quote:
} /* End fast_copy */

/******************************************************************************/
/*                               End of FAST_COPY.C                           */
/******************************************************************************/

/******************************************************************************/
/*                                                                            */
/*                               FAST_COPY_TEST.C                             */
/*                                                                            */
/******************************************************************************/
/* Author : Ross Williams.                                                    */
/* Date   : 13-April-1990.                                                    */
/*                                                                            */
/* This file contains a C program (a main() program) whose sole purpose is to */
/* test the function "fast_copy". The program was developed in the THINK C    */
/* environment.                                                               */
/*                                                                            */
/* The technique used is to allocate two slabs of memory. Then a ...

read more »



Fri, 03 Dec 1993 20:15:41 GMT  
 Fast 68000 memory copy routine.

Quote:

> 4.  The code  was written  under Lightspeed  C on  a Macintosh.  Minor
> changes may be required to port it to another environment (e.g. expand
> macros). In particular, I have omitted  some header files for the test
> module (for assertions  and portable type definitions and  so on) that
> are probably  more easily  rewritten for  the target  environment than
> copied. The code was never  designed for portability. Nevertheless, it
> should be fairly easy to port.

If this code is intended to run on a Macintosh, then please consider using
_BlockMove. It should always use the fastest possible method for copying
memory from one point to another (including all the necessary logic to copy
memory in overlapping situations).

If you find this not to be true, then please, tell us about it.

---------------------------------------------------------------------------

-  330 1/2 Waverley St.    - UUCP:ucbvax!apple!alexr        - Propulsion  -
-  Palo Alto, CA 94301     -                                - Systems     -
-  (415) 329-8463          - Nobody is my employer so       - :-)         -
-  (408) 974-3110          - nobody cares what I say.       -             -



Sat, 04 Dec 1993 03:02:41 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Wanted: fast memory copy on 486 & P5

2. fast memory copies

3. 64bit MMX Memory copy vs 32bit nonMMX copy

4. APL.68000 for Macintosh

5. Cross-talk Bug in Native APL 68000 Level II

6. Porting APL*PLUS/PC to APL.68000

7. Product Announcement - APL.68000 Level II for Mac and Power Mac

8. APL.68000 Level II for Mac experience?

9. GEM and MIDI using APL.68000; help

10. APL.68000 for the Macintosh

11. APL.68000

12. APL for 68000; MAC APL

 

 
Powered by phpBB® Forum Software