(* * LANGUAGE : ANS Forth * PROJECT : Forth Environments * DESCRIPTION : Matrix Multiplication * CATEGORY : Benchmark * AUTHOR : Mark Smotherman (mark@cs.clemson.edu) * LAST CHANGE : July 30, 2000, Marcel Hendrix, added BLAS Level 1 and 3 * LAST CHANGE : June 12, 2000, Marcel Hendrix *) NEEDS -miscutil NEEDS -fsl_util NEEDS -gaussj ( iForth only; performance test generic matrix words ) REVISION -mm "ÄÄÄ Matrix Multiply Version 1.16 ÄÄÄ" PRIVATES DOC (* matrix multiply tests -- C language, version 1.0, May 1993 compile with -DN= I usually run a script file time.script 500 >500.times where the script file contains cc -O -DN=$1 mm.c a.out -n (I suggest at least two runs per method to a.out -n alert you to variations. Five or ten runs a.out -t each, giving avg. and std dev. of times is a.out -t best.) ... Contact Mark Smotherman (mark@cs.clemson.edu) for questions, comments, and to report results showing wide variations. E.g., a wide variation appeared on an IBM RS/6000 Model 320 with "cc -O -DN=500 mm.c" (xlc compiler): 500x500 mm - normal algorithm utime 230.81 secs 500x500 mm - normal algorithm utime 230.72 secs 500x500 mm - temporary variable in loop utime 231.00 secs 500x500 mm - temporary variable in loop utime 230.79 secs 500x500 mm - unrolled inner loop, factor of 8 utime 232.09 secs 500x500 mm - unrolled inner loop, factor of 8 utime 231.84 secs 500x500 mm - pointers used to access matrices utime 230.74 secs 500x500 mm - pointers used to access matrices utime 230.45 secs 500x500 mm - blocking, factor of 32 utime 60.40 secs 500x500 mm - blocking, factor of 32 utime 60.57 secs 500x500 mm - interchanged inner loops utime 27.36 secs 500x500 mm - interchanged inner loops utime 27.40 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.49 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.50 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.10 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.05 secs The algorithms can also be sensitive to TLB thrashing. On a 600x600 test an IBM RS/6000 Model 30 showed variations depending on relative location of the matrices. (The model 30 has 64 TLB entries organized as 2-way set associative.) 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.12 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.23 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.87 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.64 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.70 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.76 secs Changing the declaration to include 10000 dummy entries between the b and c matrices (suggested by T. Maeno), i.e., double a[N][N],b[N][N],dummy[10000],c[N][N],d[N][N],bt[N][N]; 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.41 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.40 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.68 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.67 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.97 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.98 secs I hope to add other algorithms (e.g., Strassen-Winograd) in the near future. P55-166, 48 MBytes, NT 4.0, compiled with VC++ 4.0, every trick in the book. 500x500 mm - normal algorithm 5.00 MFlops, utime 49.952 secs 500x500 mm - blocking, factor of 20 14.00 MFlops, utime 17.856 secs 500x500 mm - transposed b matrix 20.26 MFlops, utime 12.338 secs 500x500 mm - Robert's algorithm 20.25 MFlops, utime 12.347 secs 500x500 mm - 20x 20 subarray (from T. Maeno) 30.19 MFlops, utime 8.282 secs 500x500 mm - 20x 20 subarray (from D. Warner) 39.00 MFlops, utime 6.410 secs 120x120 mm - normal algorithm 14.53 MFlops, utime 0.238 secs 120x120 mm - blocking, factor of 20 14.38 MFlops, utime 0.240 secs 120x120 mm - transposed b matrix 32.49 MFlops, utime 0.106 secs 120x120 mm - Robert's algorithm 32.11 MFlops, utime 0.108 secs 120x120 mm - 20x 20 subarray (from T. Maeno) 32.11 MFlops, utime 0.108 secs 120x120 mm - 20x 20 subarray (from D. Warner) 43.13 MFlops, utime 0.080 secs 60x 60 mm - normal algorithm 17.97 MFlops, utime 0.024 secs 60x 60 mm - blocking, factor of 20 14.48 MFlops, utime 0.030 secs 60x 60 mm - transposed b matrix 32.19 MFlops, utime 0.013 secs 60x 60 mm - Robert's algorithm 31.72 MFlops, utime 0.014 secs 60x 60 mm - 20x 20 subarray (from T. Maeno) 32.19 MFlops, utime 0.013 secs 60x 60 mm - 20x 20 subarray (from D. Warner) 43.20 MFlops, utime 0.010 secs P5-166 MHz, 48 MB, iForth 1.11, double-precision, Win NT 4.0 CLK 167 MHz 500x500 mm - normal algorithm 6.32 MFlops, 26.09 ticks/flop, 39.543 s 500x500 mm - blocking, factor of 20 20.52 MFlops, 8.03 ticks/flop, 12.178 s 500x500 mm - transposed B matrix 21.42 MFlops, 7.70 ticks/flop, 11.666 s 500x500 mm - Robert's algorithm 21.89 MFlops, 7.53 ticks/flop, 11.418 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 13.61 MFlops, 12.12 ticks/flop, 18.365 s 500x500 mm - D. Warner's algorithm, subarray 20x20 11.87 MFlops, 13.89 ticks/flop, 21.054 s 500x500 mm - generic mat* 63.49 MFlops, 2.59 ticks/flop, 3.937 s 500x500 mm - iForth MAT* 63.50 MFlops, 2.59 ticks/flop, 3.936 s CLK 167 MHz 120x120 mm - normal algorithm 27.77 MFlops, 5.94 ticks/flop, 0.124 s 120x120 mm - blocking, factor of 20 21.36 MFlops, 7.72 ticks/flop, 0.161 s 120x120 mm - transposed B matrix 27.35 MFlops, 6.03 ticks/flop, 0.126 s 120x120 mm - Robert's algorithm 29.32 MFlops, 5.62 ticks/flop, 0.117 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 13.89 MFlops, 11.87 ticks/flop, 0.248 s 120x120 mm - D. Warner's algorithm, subarray 20x20 12.09 MFlops, 13.64 ticks/flop, 0.285 s 120x120 mm - generic mat* 63.67 MFlops, 2.59 ticks/flop, 0.054 s 120x120 mm - iForth MAT* 63.62 MFlops, 2.59 ticks/flop, 0.054 s CLK 167 MHz 60x60 mm - normal algorithm 38.09 MFlops, 4.30 ticks/flop, 0.011 s 60x60 mm - blocking, factor of 20 20.84 MFlops, 7.86 ticks/flop, 0.020 s 60x60 mm - transposed B matrix 28.17 MFlops, 5.81 ticks/flop, 0.015 s 60x60 mm - Robert's algorithm 32.38 MFlops, 5.06 ticks/flop, 0.013 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 13.71 MFlops, 11.95 ticks/flop, 0.031 s 60x60 mm - D. Warner's algorithm, subarray 20x20 11.91 MFlops, 13.76 ticks/flop, 0.036 s 60x60 mm - generic mat* 56.58 MFlops, 2.89 ticks/flop, 0.007 s 60x60 mm - iForth MAT* 56.74 MFlops, 2.88 ticks/flop, 0.007 s PII-350 MHz, 128 MB, double precision, Linux 500x500 mm - normal algorithm 35.29 MFlops, 9.97 ticks/flop, 7.083 s 500x500 mm - blocking, factor of 20 66.51 MFlops, 5.29 ticks/flop, 3.758 s 500x500 mm - transposed B matrix 39.07 MFlops, 9.00 ticks/flop, 6.398 s 500x500 mm - Robert's algorithm 39.39 MFlops, 8.93 ticks/flop, 6.346 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 16.37 MFlops, 21.49 ticks/flop, 15.267 s 500x500 mm - D. Warner's algorithm, subarray 20x20 34.85 MFlops, 10.09 ticks/flop, 7.127 s 500x500 mm - generic mat* 186.71 MFlops, 1.88 ticks/flop, 1.338 s 500x500 mm - iForth MAT* 186.70 MFlops, 1.88 ticks/flop, 1.338 s CLK 351 MHz 120x120 mm - normal algorithm 86.24 MFlops, 4.06 ticks/flop, 0.040 s 120x120 mm - blocking, factor of 20 68.83 MFlops, 5.09 ticks/flop, 0.050 s 120x120 mm - transposed B matrix 61.87 MFlops, 5.67 ticks/flop, 0.055 s 120x120 mm - Robert's algorithm 63.70 MFlops, 5.50 ticks/flop, 0.054 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 16.49 MFlops, 21.28 ticks/flop, 0.209 s 120x120 mm - D. Warner's algorithm, subarray 20x20 35.97 MFlops, 9.83 ticks/flop, 0.096 s 120x120 mm - generic mat* 190.80 MFlops, 1.83 ticks/flop, 0.018 s 120x120 mm - iForth MAT* 190.84 MFlops, 1.83 ticks/flop, 0.018 s CLK 353 MHz 60x60 mm - normal algorithm 79.11 MFlops, 4.46 ticks/flop, 0.005 s 60x60 mm - blocking, factor of 20 69.01 MFlops, 5.11 ticks/flop, 0.006 s 60x60 mm - transposed B matrix 59.34 MFlops, 5.94 ticks/flop, 0.007 s 60x60 mm - Robert's algorithm 62.44 MFlops, 5.65 ticks/flop, 0.006 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 16.56 MFlops, 21.31 ticks/flop, 0.026 s 60x60 mm - D. Warner's algorithm, subarray 20x20 35.61 MFlops, 9.91 ticks/flop, 0.012 s 60x60 mm - generic mat* 170.97 MFlops, 2.06 ticks/flop, 0.002 s 60x60 mm - iForth MAT* 170.97 MFlops, 2.06 ticks/flop, 0.002 s CLK 902 MHz 500x500 mm - normal algorithm 31.26 MFlops, 28.84 ticks/flop, 7.995 s 500x500 mm - blocking, factor of 20 171.57 MFlops, 5.25 ticks/flop, 1.457 s 500x500 mm - transposed B matrix 128.85 MFlops, 7.00 ticks/flop, 1.940 s 500x500 mm - Robert's algorithm 129.93 MFlops, 6.94 ticks/flop, 1.924 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 104.54 MFlops, 8.62 ticks/flop, 2.391 s 500x500 mm - D. Warner's algorithm, subarray 20x20 96.11 MFlops, 9.38 ticks/flop, 2.601 s 500x500 mm - generic mat* 468.22 MFlops, 1.92 ticks/flop, 0.533 s 500x500 mm - iForth MAT* 469.17 MFlops, 1.92 ticks/flop, 0.532 s CLK 902 MHz 120x120 mm - normal algorithm 265.21 MFlops, 3.38 ticks/flop, 0.013 s 120x120 mm - blocking, factor of 20 198.21 MFlops, 4.53 ticks/flop, 0.017 s 120x120 mm - transposed B matrix 326.42 MFlops, 2.75 ticks/flop, 0.010 s 120x120 mm - Robert's algorithm 356.70 MFlops, 2.51 ticks/flop, 0.009 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 112.83 MFlops, 7.95 ticks/flop, 0.030 s 120x120 mm - D. Warner's algorithm, subarray 20x20 102.49 MFlops, 8.76 ticks/flop, 0.033 s 120x120 mm - generic mat* 682.98 MFlops, 1.31 ticks/flop, 0.005 s 120x120 mm - iForth MAT* 683.12 MFlops, 1.31 ticks/flop, 0.005 s CLK 902 MHz 60x60 mm - normal algorithm 343.83 MFlops, 2.62 ticks/flop, 0.001 s 60x60 mm - blocking, factor of 20 201.55 MFlops, 4.47 ticks/flop, 0.002 s 60x60 mm - transposed B matrix 312.96 MFlops, 2.88 ticks/flop, 0.001 s 60x60 mm - Robert's algorithm 363.42 MFlops, 2.48 ticks/flop, 0.001 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 113.65 MFlops, 7.93 ticks/flop, 0.003 s 60x60 mm - D. Warner's algorithm, subarray 20x20 103.61 MFlops, 8.70 ticks/flop, 0.004 s 60x60 mm - generic mat* 701.42 MFlops, 1.28 ticks/flop, 0.000 s 60x60 mm - iForth MAT* 701.87 MFlops, 1.28 ticks/flop, 0.000 s *) ENDDOC TICKS-RESET ( time compile speed ) ALSO ASSEMBLER \ All assume internal FPU stack is valid : *+DF2@ 1 0 IN/OUT POSTPONE ASM{ [64]fldB, $5B C, ( ebx pop, ) mul64[ebx] fadd, }ASM ADJUST-STACK ; IMMEDIATE : FXCH(1) POSTPONE ASM{ 1 @fexchange }ASM ; IMMEDIATE : FXCH(2) POSTPONE ASM{ 2 @fexchange }ASM ; IMMEDIATE : FXCH(3) POSTPONE ASM{ 3 @fexchange }ASM ; IMMEDIATE PREVIOUS FALSE VALUE SHHT? PRIVATE 0 VALUE FLOPS PRIVATE 0 VALUE t/flops PRIVATE 0 VALUE msecs PRIVATE #120 VALUE N \ #120 =: N \ N DFLOATS =: N_DFLOATS 1 DFLOATS =: DFLOAT1 PRIVATE 4 DFLOATS =: DFLOAT4 PRIVATE 8 DFLOATS =: DFLOAT8 PRIVATE DOUBLE DMATRIX a{{ PRIVATE DOUBLE DMATRIX b{{ PRIVATE DOUBLE DMATRIX c{{ PRIVATE DOUBLE DMATRIX d{{ PRIVATE DOUBLE DMATRIX bt{{ PRIVATE : ?# ( d -- d ) 2DUP OR 0= IF BL HOLD ELSE # ENDIF ;P : .FLOPS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# ?# #> TYPE ." MFlops" ;P : .TICKS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ." ticks/flop" ;P : .SECS ( n -- ) U>D <# 's' HOLD BL HOLD # # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ;P : INIT-RESULT EVAL" TICKS-RESET 0 >S BEGIN " ;P IMMEDIATE : (.RES) ( n -- ) DUP N * N * N * 2* 1 OR TICKS? ( -- n flops dticks ) 3DUP TICKS>US DROP 1 OR DUP >R #100 SWAP */ TO FLOPS #100 1 M*/ ROT UM/MOD NIP TO t/flops R> SWAP #1000 * / TO msecs SHHT? IF EXIT ENDIF FLOPS .FLOPS t/flops .TICKS msecs .SECS ;P : .RESULT EVAL" 1 S+! US? #2000000. D> UNTIL S> (.RES) " ;P IMMEDIATE \ Set coefficients so that result matrix should have row entries equal to (1/2)*n*(n-1)*i in row i : SET-COEFFICIENTS ( -- ) N 0 ?DO N 0 ?DO J S>F FDUP b{{ J I }} DF! a{{ J I }} DF! LOOP LOOP ; : FLUSH-CACHE ( -- ) N 0 ?DO N 0 ?DO 0e d{{ J I }} DF! LOOP LOOP ; : CHECK-RESULT ( -- ) FLOPS 0= IF SHHT? 0= IF CR ." algorithm aborted" ENDIF EXIT ENDIF N N 1- * 2/ S>F 0e FLOCALS| row_sum sum | N 0 ?DO I S>F sum F* TO row_sum N 0 ?DO c{{ J I }} DF@ row_sum F<> IF CR ." error in result entry c{{ " J DEC. I DEC. ." }}: " c{{ J I }} DF@ F. ." <> " row_sum F. UNLOOP UNLOOP EXIT ENDIF a{{ J I }} DF@ J S>F F<> IF CR ." error in result entry a{{ " J DEC. I DEC. ." }}: " a{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF b{{ J I }} DF@ J S>F F<> IF CR ." error in result entry b{{ " J DEC. I DEC. ." }}: " b{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF LOOP LOOP ;P : GENERIC1() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - generic mat*" 54 HTAB ENDIF INIT-RESULT a{{ b{{ c{{ mat* .RESULT ;P : GENERIC2() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - iForth MAT*" 54 HTAB ENDIF INIT-RESULT a{{ 0 0 }} N b{{ 0 0 }} N c{{ 0 0 }} N N N N DGEMM1 .RESULT ;P : GENERIC() ( n -- ) CASE 1 OF GENERIC1() ENDOF 2 OF GENERIC2() ENDOF CR ." mm - unknown iForth algorithm " . ." not implemented" ENDCASE ;P : NORMAL() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - normal algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO c{{ I 0 }} a{{ I 0 }} >S b{{ 0 0 }} N DFLOATS BOUNDS ?DO S 1 I N N DDOT DF!+ DFLOAT1 +LOOP -S DROP LOOP .RESULT ;P : TRANSPOSE() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - transposed B matrix" 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO c{{ I 0 }} N 0 ?DO a{{ J 0 }} 1 bt{{ I 0 }} 1 N DDOT DF!+ LOOP DROP LOOP .RESULT ;P : TRANSPOSE2() ( -- ) 0 0 N DFLOATS LOCALS| N_DFLOATS bt[i,0] a[j,0] | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - transposed B matrix #2" 54 HTAB ENDIF N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP \ out of the loop INIT-RESULT a{{ DADDR TO a[j,0] c{{ DADDR N 0 ?DO bt{{ DADDR TO bt[i,0] N 0 ?DO a[j,0] 1 bt[i,0] 1 N DDOT DF!+ N_DFLOATS +TO bt[i,0] LOOP N_DFLOATS +TO a[j,0] LOOP DROP .RESULT ;P \ from Monica Lam ASPLOS-IV paper : TILING() ( step -- ) DUP 4 N 1+ WITHIN 0= IF SHHT? 0= IF CR ." mm - blocking step size of " DUP DEC. ." is unreasonable" ENDIF DROP EXIT ENDIF SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - blocking, factor of " DUP DEC. 54 HTAB ENDIF 0 0 LOCALS| kk jj step | INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk N 0 ?DO I TO jj N 0 ?DO a{{ I kk }} kk step + N MIN kk ?DO DF@+ b{{ I jj }} 1 c{{ J jj }} 1 jj step + N MIN jj - DAXPY LOOP DROP LOOP step +LOOP step +LOOP .RESULT ;P \ ******************************************** \ * Contributed by Robert Debath 26 Nov 1995 * \ * rdebath@cix.compulink.co.uk * \ ******************************************** : ROBERT() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - Robert's algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO b{{ I 0 }} 1 bt{{ 0 I }} N N DCOPY LOOP a{{ 0 0 }} >S N 0 ?DO bt{{ 0 0 }} c{{ I 0 }} N 0 ?DO S 1 3 PICK 1 N DDOT DF!+ SWAP N DFLOAT[] SWAP LOOP 2DROP N DFLOATS S+! LOOP -S .RESULT ;P DOC (* * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University * * mmbu2.f multiplies matrices a and b * a and b are n by n matrices * nb is the blocking parameter. * the tuning guide indicates nb = 50 is reasonable for the * ibm model 530 hence 25 should be reasonable for the 320 * since the 320 has 32k rather than 64k of cache. * Inner loops unrolled to depth of 2 * The loop functions without clean up code at the end only * if the unrolling occurs to a depth k which divides into n * in this case n must be divisible by 2. * The blocking parameter nb must divide into n if the * multiply is to succeed without clean up code at the end. * * converted to c by Mark Smotherman * note that nb must also be divisible by 2 => cannot use 25, so use 20 *) ENDDOC : WARNER() ( nb -- ) 0 0 0 0 0 N DFLOATS LOCALS| N_DFLOATS 'a 'b ii jj kk nb | 0e 0e 0e 0e FLOCALS| s10 s00 s01 s11 | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N nb MOD N 2 MOD OR IF SHHT? 0= IF ." mm - Warner's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." ENDIF EXIT ENDIF nb 2 MOD IF SHHT? 0= IF ." mm - block size for Warner method must be evenly divisible by 2" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - D. Warner's algorithm, subarray " nb 0 .R 'x' EMIT nb 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO nb jj + jj ?DO c{{ J I }} DUP DF@+ TO s00 DUP DF@ TO s01 c{{ J 1+ I }} DUP DF@+ TO s10 DUP DF@ TO s11 a{{ J kk }} TO 'a b{{ kk I }} TO 'b nb kk + kk ?DO \ everything on the stack does not pay off 'a DF@ FDUP 'b DF@+ F* +TO s00 DF@ F* +TO s01 'a N_DFLOATS + DF@ FDUP 'b DF@+ F* +TO s10 DF@ F* +TO s11 DFLOAT1 +TO 'a N_DFLOATS +TO 'b LOOP s11 DF! s10 DF! s01 DF! s00 DF! 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .RESULT ;P \ Using a little assembler magic... DFLOAT1 =: COL : AWARNER() ( nb -- ) 0 0 0 0 0 0 N DFLOATS LOCALS| ROW a[i][k] b[k][j] c[j][i] ii jj kk nb | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N nb MOD N 2 MOD OR IF SHHT? 0= IF ." mm - Warner's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." ENDIF EXIT ENDIF nb 2 MOD IF SHHT? 0= IF ." mm - block size for Warner method must be evenly divisible by 2" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - D. Warner's algorithm, subarray " nb 0 .R 'x' EMIT nb 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO c{{ I jj }} TO c[j][i] nb jj + jj ?DO c[j][i] DUP DF@+ ( F: -- s00 ) DUP DF@ ( F: -- s00 s01 ) c[j][i] ROW + DUP DF@+ ( F: -- s00 s01 s10 ) DUP DF@ ( F: -- s00 s01 s10 s11 ) a{{ J kk }} TO a[i][k] b{{ kk I }} TO b[k][j] ASM{ fpop, fpop, fpop, fpop, }ASM nb kk + kk ?DO a[i][k] b[k][j] *+DF2@ ( +TO s00 ) a[i][k] FXCH(1) b[k][j] COL + *+DF2@ FXCH(1) ( +TO s01 ) a[i][k] ROW + FXCH(2) b[k][j] *+DF2@ FXCH(2) ( +TO s10 ) a[i][k] ROW + FXCH(3) b[k][j] COL + *+DF2@ FXCH(3) ( +TO s11 ) COL +TO a[i][k] ROW +TO b[k][j] LOOP ASM{ fpush, fpush, fpush, fpush, }ASM ( s11 ) DF! ( s10 ) DF! ( s01 ) DF! ( s00 ) DF! COL 2* +TO c[j][i] 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .RESULT ;P DOC (* Matrix Multiply tuned for SS-10/30; * Maeno Toshinori * Tokyo Institute of Technology * * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30. * * in original algorithm - sub-area for cache tiling * #define L 20 * #define L2 20 * three 20x20 matrices reside in cache; two may be enough *) ENDDOC : MAENO() ( nb -- ) 0 0 0 0 0 0 N 1- DFLOATS DUP DFLOAT+ LOCALS| (N)DF (N-1)DF a[j,i] b[i,k] it kt i2 kk lparm | 0e 0e 0e 0e 0e 0e 0e 0e FLOCALS| t0 t1 t2 t3 t4 t5 t6 t7 | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N lparm MOD N 4 MOD OR IF SHHT? 0= IF ." mm - Maeno's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " lparm DEC. ." and 4." ENDIF EXIT ENDIF lparm 4 MOD IF SHHT? 0= IF ." mm - block size for Maeno's method must be evenly divisible by 4" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - T. Maeno's algorithm, subarray " lparm 0 .R 'x' EMIT lparm 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO it i2 ?DO CLEAR t0 CLEAR t1 CLEAR t2 CLEAR t3 CLEAR t4 CLEAR t5 CLEAR t6 CLEAR t7 a{{ I kk }} TO a[j,i] b{{ kk J }} TO b[i,k] kt kk ?DO a[j,i] DF@+ TO a[j,i] FDUP b[i,k] DUP DF@+ F* +TO t0 FDUP DF@+ F* +TO t1 FDUP DF@+ F* +TO t2 DF@ F* +TO t3 a[j,i] (N-1)DF + DF@ FDUP DF@+ F* +TO t4 FDUP DF@+ F* +TO t5 FDUP DF@+ F* +TO t6 DF@ F* +TO t7 (N)DF +TO b[i,k] LOOP t0 c{{ I J }} DF+!+ t1 DF+!+ t2 DF+!+ t3 DF+! t4 c{{ I 1+ J }} DF+!+ t5 DF+!+ t6 DF+!+ t7 DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP .RESULT ;P : GET-MEM a{{ N N }}malloc malloc-fail? b{{ N N }}malloc malloc-fail? OR bt{{ N N }}malloc malloc-fail? OR c{{ N N }}malloc malloc-fail? OR d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core" ;P : RELEASE-MEM d{{ }}free c{{ }}free bt{{ }}free b{{ }}free a{{ }}free ;P : MM ( char n -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF LOCAL ur GET-MEM SET-COEFFICIENTS FLUSH-CACHE CASE 'n' OF NORMAL() ENDOF 'g' OF ur GENERIC() ENDOF 't' OF TRANSPOSE() ENDOF 'T' OF TRANSPOSE2() ENDOF 'b' OF ur TILING() ENDOF 'r' OF ROBERT() ENDOF 'm' OF ur MAENO() ENDOF 'w' OF ur WARNER() ENDOF 'a' OF ur AWARNER() ENDOF CR ." `" DUP EMIT ." ' is an invalid algorithm" ENDCASE CHECK-RESULT RELEASE-MEM ; : ALL-TESTS ( -- ) CR ." CLK " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" 'n' mm EKEY? IF EKEY DROP EXIT ELSE 'b' 20 mm 't' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'T' mm 'r' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'm' 20 mm 'w' 20 mm 'a' 20 mm ENDIF 'g' 1 mm 'g' 2 mm ; : NEXT-N ( -- ) N #1200 #1000 */ TO N #17 1 DO N I 2^x DUP 9 #10 */ SWAP #11 #10 */ WITHIN IF I 2^x TO N LEAVE ENDIF LOOP ;P : MEGAFLOPS ( sel -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF 0 0 SHHT? N LOCALS| old-N silence? flp ix ur algo | TRUE TO SHHT? CR ." Algorithm = '" algo EMIT &' EMIT ur IF ." , parameter is " ur 0 .R ENDIF ." , clock = " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" #32 TO N #17 0 DO CR ." testing data size " N 3 .R ." x " N 3 .R ':' EMIT algo ur mm FLOPS flp > IF FLOPS TO flp N TO ix ENDIF FLOPS .FLOPS NEXT-N CLEAR FLOPS EKEY? IF EKEY DROP LEAVE ENDIF LOOP ix IF CR CR ." Maximum: " flp .FLOPS ." at N = " ix DEC. ENDIF silence? TO SHHT? old-N TO N ; :ABOUT CR ." -------------------------- Double-precision benchmark --------------------------" CR CR ." Try: 'n' mm -- normal" CR ." 't' mm -- with transposed b matrix" CR ." 'T' mm -- with transposed b matrix, different cosmetics" CR ." 'b' n mm -- using blocking by n, 4 < n < " N DEC. CR ." 'r' mm -- using Robert's algorithm" CR ." 'm' n mm -- using Maeno's algorithm with blocking factor n" CR ." 'w' n mm -- using Warner's algorithm with blocking factor n" CR ." 'a' n mm -- using aWarner's algorithm with blocking factor n" CR ." 'g' n mm -- generic iForth algorithm, n = {1,2}" CR CR ." ALL-TESTS -- test all algorithms" CR ." ( x ) MEGAFLOPS -- find optimum size for this machine, algorithm 'x'" ; .ABOUT -mm CR DEPRIVE CR .( Compile time = ) US? #1000 UM/MOD NIP DEC. .( ms, assuming clockspeed = ) PROCESSOR-CLOCK DEC. .( MHz.) (* End of Source *)