(* * LANGUAGE : ANS Forth * PROJECT : Forth Environments * DESCRIPTION : Matrix Multiplication * CATEGORY : Benchmark * AUTHOR : Mark Smotherman (mark@cs.clemson.edu) * LAST CHANGE : June 12, 2000, Marcel Hendrix *) NEEDS -miscutil NEEDS -fsl_util REVISION -smm "ÄÄÄ Matrix Multiply [S] Version 1.02 ÄÄÄ" PRIVATES DOC (* matrix multiply tests -- C language, version 1.0, May 1993 compile with -DN= I usually run a script file time.script 500 >500.times where the script file contains cc -O -DN=$1 mm.c a.out -n (I suggest at least two runs per method to a.out -n alert you to variations. Five or ten runs a.out -t each, giving avg. and std dev. of times is a.out -t best.) ... Contact Mark Smotherman (mark@cs.clemson.edu) for questions, comments, and to report results showing wide variations. E.g., a wide variation appeared on an IBM RS/6000 Model 320 with "cc -O -DN=500 mm.c" (xlc compiler): 500x500 mm - normal algorithm utime 230.81 secs 500x500 mm - normal algorithm utime 230.72 secs 500x500 mm - temporary variable in loop utime 231.00 secs 500x500 mm - temporary variable in loop utime 230.79 secs 500x500 mm - unrolled inner loop, factor of 8 utime 232.09 secs 500x500 mm - unrolled inner loop, factor of 8 utime 231.84 secs 500x500 mm - pointers used to access matrices utime 230.74 secs 500x500 mm - pointers used to access matrices utime 230.45 secs 500x500 mm - blocking, factor of 32 utime 60.40 secs 500x500 mm - blocking, factor of 32 utime 60.57 secs 500x500 mm - interchanged inner loops utime 27.36 secs 500x500 mm - interchanged inner loops utime 27.40 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.49 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.50 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.10 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.05 secs The algorithms can also be sensitive to TLB thrashing. On a 600x600 test an IBM RS/6000 Model 30 showed variations depending on relative location of the matrices. (The model 30 has 64 TLB entries organized as 2-way set associative.) 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.12 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.23 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.87 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.64 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.70 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.76 secs Changing the declaration to include 10000 dummy entries between the b and c matrices (suggested by T. Maeno), i.e., double a[N][N],b[N][N],dummy[10000],c[N][N],d[N][N],bt[N][N]; 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.41 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.40 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.68 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.67 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.97 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.98 secs I hope to add other algorithms (e.g., Strassen-Winograd) in the near future. P55-166 MHz, 48 MB RAM, single-precision, Linux 2.0/NT 4.0, iForth 1.11b CLK 165 MHz 500x500 mm - normal algorithm 9.57 MFlops, 17.24 ticks/flop, 26.122 s 500x500 mm - blocking, factor of 20 21.82 MFlops, 7.56 ticks/flop, 11.457 s 500x500 mm - transposed B matrix 35.30 MFlops, 4.67 ticks/flop, 7.081 s 500x500 mm - Robert's algorithm 36.63 MFlops, 4.50 ticks/flop, 6.823 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 6.72 MFlops, 24.55 ticks/flop, 37.197 s 500x500 mm - D. Warner's algorithm, subarray 20x20 8.67 MFlops, 19.02 ticks/flop, 28.827 s 500x500 mm - generic iForth MAT* 77.79 MFlops, 2.12 ticks/flop, 3.213 s CLK 165 MHz 120x120 mm - normal algorithm 50.20 MFlops, 3.28 ticks/flop, 0.068 s 120x120 mm - blocking, factor of 20 21.30 MFlops, 7.74 ticks/flop, 0.162 s 120x120 mm - transposed B matrix 39.02 MFlops, 4.22 ticks/flop, 0.088 s 120x120 mm - Robert's algorithm 46.05 MFlops, 3.58 ticks/flop, 0.075 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 6.67 MFlops, 24.72 ticks/flop, 0.517 s 120x120 mm - D. Warner's algorithm, subarray 20x20 8.54 MFlops, 19.31 ticks/flop, 0.404 s 120x120 mm - generic iForth MAT* 70.74 MFlops, 2.33 ticks/flop, 0.048 s CLK 165 MHz 60x60 mm - normal algorithm 38.03 MFlops, 4.33 ticks/flop, 0.011 s 60x60 mm - blocking, factor of 20 20.48 MFlops, 8.05 ticks/flop, 0.021 s 60x60 mm - transposed B matrix 32.23 MFlops, 5.11 ticks/flop, 0.013 s 60x60 mm - Robert's algorithm 42.90 MFlops, 3.84 ticks/flop, 0.010 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 6.58 MFlops, 25.07 ticks/flop, 0.065 s 60x60 mm - D. Warner's algorithm, subarray 20x20 8.40 MFlops, 19.62 ticks/flop, 0.051 s 60x60 mm - generic iForth MAT* 61.28 MFlops, 2.69 ticks/flop, 0.007 s PII-350 MHz, 128 MB, single precision, Linux CLK 353 MHz 500x500 mm - normal algorithm 59.87 MFlops, 5.89 ticks/flop, 4.175 s 500x500 mm - blocking, factor of 20 72.00 MFlops, 4.90 ticks/flop, 3.471 s 500x500 mm - transposed B matrix 68.28 MFlops, 5.16 ticks/flop, 3.661 s 500x500 mm - Robert's algorithm 67.40 MFlops, 5.23 ticks/flop, 3.708 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 14.18 MFlops, 24.89 ticks/flop, 17.628 s 500x500 mm - D. Warner's algorithm, subarray 20x20 32.28 MFlops, 10.93 ticks/flop, 7.742 s CLK 353 MHz 120x120 mm - normal algorithm 135.92 MFlops, 2.59 ticks/flop, 0.025 s 120x120 mm - blocking, factor of 20 73.26 MFlops, 4.81 ticks/flop, 0.047 s 120x120 mm - transposed B matrix 128.13 MFlops, 2.75 ticks/flop, 0.026 s 120x120 mm - Robert's algorithm 146.76 MFlops, 2.40 ticks/flop, 0.023 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 14.18 MFlops, 24.87 ticks/flop, 0.243 s 120x120 mm - D. Warner's algorithm, subarray 20x20 32.29 MFlops, 10.92 ticks/flop, 0.107 s CLK 351 MHz 60x60 mm - normal algorithm 135.57 MFlops, 2.58 ticks/flop, 0.003 s 60x60 mm - blocking, factor of 20 70.53 MFlops, 4.97 ticks/flop, 0.006 s 60x60 mm - transposed B matrix 112.55 MFlops, 3.11 ticks/flop, 0.003 s 60x60 mm - Robert's algorithm 141.99 MFlops, 2.47 ticks/flop, 0.003 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 14.01 MFlops, 25.04 ticks/flop, 0.030 s 60x60 mm - D. Warner's algorithm, subarray 20x20 31.66 MFlops, 11.08 ticks/flop, 0.013 s *) ENDDOC FALSE VALUE SHHT? PRIVATE 0 VALUE FLOPS PRIVATE 0 VALUE t/flops PRIVATE 0 VALUE msecs PRIVATE #120 VALUE N 1 SFLOATS =: SFLOAT1 4 SFLOATS =: SFLOAT4 8 SFLOATS =: SFLOAT8 SINGLE DMATRIX a{{ PRIVATE SINGLE DMATRIX b{{ PRIVATE SINGLE DMATRIX c{{ PRIVATE SINGLE DMATRIX d{{ PRIVATE SINGLE DMATRIX bt{{ PRIVATE #166 VALUE PROCESSOR-CLOCK 2VARIABLE _ticks_ PRIVATE ( counts clock ticks ) \ uses EDX:EAX : TICKS-GET ( -- d ) [ $52 C, $5B C, ( edx push, ebx pop, ) $0F C, $31 C, $50 C, $52 C, ( RDTSC, eax push, edx push, ) $53 C, $5A C, ( ebx push, edx pop, ) ] ; : TICKS-RESET ( -- ) TICKS-GET _ticks_ 2! ; TICKS-RESET : TICKS>US ( d -- u ) PROCESSOR-CLOCK UM/MOD NIP ; : TICKS? ( -- u ) TICKS-GET _ticks_ 2@ D- ; : US? ( -- us ) TICKS? TICKS>US ; : CALIBRATE ( -- ) TICKS-RESET #1000 MS TICKS? #1000000 UM/MOD NIP TO PROCESSOR-CLOCK ; : ?# ( d -- d ) 2DUP OR 0= IF BL HOLD ELSE # ENDIF ;P : .FLOPS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# ?# #> TYPE ." MFlops" ;P : .TICKS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ." ticks/flop" ;P : .SECS ( n -- ) U>D <# 's' HOLD BL HOLD # # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ;P : INIT-RESULT EVAL" TICKS-RESET 0 >S BEGIN " ;P IMMEDIATE : (.RES) ( n -- ) DUP N * N * N * 2* 1 OR TICKS? ( -- n fl dti ) 3DUP TICKS>US 1 OR DUP >R #100 SWAP */ TO FLOPS #100 1 M*/ ROT UM/MOD NIP TO t/flops R> SWAP #1000 * / TO msecs SHHT? IF EXIT ENDIF FLOPS .FLOPS t/flops .TICKS msecs .SECS ;P : .RESULT EVAL" 1 S+! US? #1000000 > UNTIL S> (.RES) " ;P IMMEDIATE \ Set coefficients so that result matrix should have row entries equal to (1/2)*n*(n-1)*i in row i : SET-COEFFICIENTS ( -- ) N 0 ?DO N 0 ?DO J S>F FDUP b{{ J I }} SF! a{{ J I }} SF! LOOP LOOP ; : FLUSH-CACHE ( -- ) N 0 ?DO N 0 ?DO 0e d{{ J I }} SF! LOOP LOOP ; : CHECK-RESULT ( -- ) FLOPS 0= IF SHHT? 0= IF CR ." algorithm aborted" ENDIF EXIT ENDIF N N 1- * 2/ S>F 0e FLOCALS| row_sum sum | N 0 ?DO I S>F sum F* TO row_sum N 0 ?DO c{{ J I }} SF@ row_sum SF<> IF CR ." error in result entry c{{ " J DEC. I DEC. ." }}: " c{{ J I }} SF@ F. ." <> " row_sum F. UNLOOP UNLOOP EXIT ENDIF a{{ J I }} SF@ J S>F SF<> IF CR ." error in result entry a{{ " J DEC. I DEC. ." }}: " a{{ J I }} SF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF b{{ J I }} SF@ J S>F SF<> IF CR ." error in result entry b{{ " J DEC. I DEC. ." }}: " b{{ J I }} SF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF LOOP LOOP ;P : GENERIC1() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - generic iForth MAT*" 54 HTAB ENDIF INIT-RESULT a{{ 0 0 }} N b{{ 0 0 }} N c{{ 0 0 }} N N N N SGEMM1 .RESULT ;P : GENERIC() ( n -- ) CASE 1 OF GENERIC1() ENDOF CR ." mm - unknown standard iForth algorithm " . ." not implemented" ENDCASE ;P : NORMAL() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - normal algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO c{{ I 0 }} a{{ I 0 }} >S b{{ 0 0 }} N SFLOATS BOUNDS ?DO S 1 I N N SDOT SF!+ SFLOAT1 +LOOP DROP -S LOOP .RESULT ;P : TRANSPOSE() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - transposed B matrix" 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO b{{ J I }} SF@ bt{{ I J }} SF! LOOP LOOP N 0 ?DO c{{ I 0 }} N 0 ?DO a{{ J 0 }} 1 bt{{ I 0 }} 1 N SDOT SF!+ LOOP DROP LOOP .RESULT ;P \ from Monica Lam ASPLOS-IV paper : TILING() ( step -- ) DUP 4 N 1+ WITHIN 0= IF SHHT? 0= IF CR ." mm - blocking step size of " DUP DEC. ." is unreasonable" ENDIF DROP EXIT ENDIF SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - blocking, factor of " DUP DEC. 54 HTAB ENDIF 0 0 LOCALS| kk jj step | INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} SF! LOOP LOOP N 0 ?DO I TO kk N 0 ?DO I TO jj N 0 ?DO a{{ I kk }} kk step + N MIN kk ?DO SF@+ b{{ I jj }} 1 c{{ J jj }} 1 jj step + N MIN jj - SAXPY LOOP DROP LOOP step +LOOP step +LOOP .RESULT ;P \ ******************************************** \ * Contributed by Robert Debath 26 Nov 1995 * \ * rdebath@cix.compulink.co.uk * \ ******************************************** : ROBERT() ( -- ) 0 0 LOCALS| 'a 'b | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - Robert's algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO b{{ I 0 }} 1 bt{{ 0 I }} N N SCOPY LOOP a{{ 0 0 }} TO 'a N 0 ?DO bt{{ 0 0 }} TO 'b c{{ I 0 }} N 0 ?DO 'a 1 'b 1 N SDOT SF!+ N SFLOATS +TO 'b LOOP DROP N SFLOATS +TO 'a LOOP .RESULT ;P DOC (* * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University * * mmbu2.f multiplies matrices a and b * a and b are n by n matrices * nb is the blocking parameter. * the tuning guide indicates nb = 50 is reasonable for the * ibm model 530 hence 25 should be reasonable for the 320 * since the 320 has 32k rather than 64k of cache. * Inner loops unrolled to depth of 2 * The loop functions without clean up code at the end only * if the unrolling occurs to a depth k which divides into n * in this case n must be divisible by 2. * The blocking parameter nb must divide into n if the * multiply is to succeed without clean up code at the end. * * converted to c by Mark Smotherman * note that nb must also be divisible by 2 => cannot use 25, so use 20 *) ENDDOC : WARNER() ( nb -- ) 0 0 0 0 0 LOCALS| 'a 'b ii jj kk nb | 0e 0e 0e 0e FLOCALS| s10 s00 s01 s11 | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N nb MOD N 2 MOD OR IF SHHT? 0= IF ." mm - Warner's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." ENDIF EXIT ENDIF nb 2 MOD IF SHHT? 0= IF ." mm - block size for Warner method must be evenly divisible by 2" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - D. Warner's algorithm, subarray " nb 0 .R 'x' EMIT nb 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} SF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO nb jj + jj ?DO c{{ J I }} DUP SF@+ TO s00 DUP SF@ TO s01 c{{ J 1+ I }} DUP SF@+ TO s10 DUP SF@ TO s11 a{{ J kk }} TO 'a b{{ kk I }} TO 'b nb kk + kk ?DO 'a DUP SF@ 'b SF@+ F* +TO s00 SWAP SF@ SF@ F* +TO s01 'a N SFLOAT[] DUP SF@ 'b SF@+ F* +TO s10 SWAP SF@ SF@ F* +TO s11 SFLOAT1 +TO 'a N SFLOATS +TO 'b LOOP s11 SF! s10 SF! s01 SF! s00 SF! 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .RESULT ;P DOC (* Matrix Multiply tuned for SS-10/30; * Maeno Toshinori * Tokyo Institute of Technology * * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30. * * in original algorithm - sub-area for cache tiling * #define L 20 * #define L2 20 * three 20x20 matrices reside in cache; two may be enough *) ENDDOC : MAENO() ( nb -- ) 0 0 0 0 LOCALS| it kt i2 kk lparm | 0e 0e 0e 0e 0e 0e 0e 0e FLOCALS| t0 t1 t2 t3 t4 t5 t6 t7 | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N lparm MOD N 4 MOD OR IF SHHT? 0= IF ." mm - Maeno's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " lparm DEC. ." and 4." ENDIF EXIT ENDIF lparm 4 MOD IF SHHT? 0= IF ." mm - block size for Maeno's method must be evenly divisible by 4" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - T. Maeno's algorithm, subarray " lparm 0 .R 'x' EMIT lparm 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} SF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO it i2 ?DO CLEAR t0 CLEAR t1 CLEAR t2 CLEAR t3 CLEAR t4 CLEAR t5 CLEAR t6 CLEAR t7 kt kk ?DO a{{ J I }} SF@ FDUP b{{ I K }} DUP SF@+ F* +TO t0 FDUP SF@+ F* +TO t1 FDUP SF@+ F* +TO t2 SF@ F* +TO t3 a{{ J 1+ I }} SF@ FDUP SF@+ F* +TO t4 FDUP SF@+ F* +TO t5 FDUP SF@+ F* +TO t6 SF@ F* +TO t7 LOOP t0 c{{ I J }} SF+!+ t1 SF+!+ t2 SF+!+ t3 SF+! t4 c{{ I 1+ J }} SF+!+ t5 SF+!+ t6 SF+!+ t7 SF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP .RESULT ;P : MM ( char n -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF LOCAL ur a{{ N N }}malloc malloc-fail? b{{ N N }}malloc malloc-fail? OR bt{{ N N }}malloc malloc-fail? OR c{{ N N }}malloc malloc-fail? OR d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core" SET-COEFFICIENTS FLUSH-CACHE CASE 'n' OF NORMAL() ENDOF 't' OF TRANSPOSE() ENDOF 'b' OF ur TILING() ENDOF 'r' OF ROBERT() ENDOF 'm' OF ur MAENO() ENDOF 'w' OF ur WARNER() ENDOF 'g' OF ur GENERIC() ENDOF CR ." `" DUP EMIT ." ' is an invalid algorithm" ENDCASE CHECK-RESULT d{{ }}free c{{ }}free bt{{ }}free b{{ }}free a{{ }}free ; : ALL-TESTS ( -- ) CR ." CLK " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" 'n' mm EKEY? IF EKEY DROP EXIT ELSE 'b' 20 mm 't' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'r' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'm' 20 mm 'w' 20 mm ENDIF 'g' 1 mm ; : NEXT-N ( -- ) N #1200 #1000 */ TO N #17 1 DO N I 2^x DUP 9 #10 */ SWAP #11 #10 */ WITHIN IF I 2^x TO N LEAVE ENDIF LOOP ;P : MEGAFLOPS ( sel -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF 0 0 SHHT? N LOCALS| old-N silence? flp ix ur algo | TRUE TO SHHT? CR ." Algorithm = '" algo EMIT &' EMIT ur IF ." , parameter is " ur 0 .R ENDIF ." , clock = " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" #32 TO N #17 0 DO CR ." testing data size " N 3 .R ." x " N 3 .R ':' EMIT algo ur mm FLOPS flp > IF FLOPS TO flp N TO ix ENDIF FLOPS .FLOPS NEXT-N CLEAR FLOPS EKEY? IF EKEY DROP LEAVE ENDIF LOOP ix IF CR CR ." Maximum: " flp .FLOPS ." at N = " ix DEC. ENDIF silence? TO SHHT? old-N TO N ; :ABOUT CR ." -------------------------- Single-precision benchmark --------------------------" CR CR ." Try: 'n' mm -- normal" CR ." 't' mm -- with transposed b matrix" CR ." 'b' n mm -- using blocking by n, 4 < n < " N DEC. CR ." 'r' mm -- using Robert's algorithm" CR ." 'm' n mm -- using Maeno's algorithm with blocking factor n" CR ." 'w' n mm -- using Warner's algorithm with blocking factor n" CR ." 'g' n mm -- use generic iForth method n" CR CR ." ALL-TESTS -- test all algorithms" CR ." ( x ) MEGAFLOPS -- find optimum size for this machine, algorithm 'x'" ; .ABOUT -smm CR DEPRIVE CR .( Compile time = ) US? #1000 / DEC. .( ms, assuming clockspeed = ) PROCESSOR-CLOCK DEC. .( MHz.) (* End of Source *)