The MM benchmark code for different systems

The matrix multiply floating-point benchmark in Forth

Benchmark results

MM is a simple benchmark that multiplies two big FP matrices, using several different approaches. Large differences in run-time result from seemingly unimportant code changes. However, these variations are only brought out by an optimizing Forth compiler that correctly schedules FP instructions. Once scheduling is in effect, the next problem is the large performance hit current CPUs take when their memory accesses are unsufficiently cached. This is solved by the Forth programmer when he/she uses a high-level algorithm that accesses memory in an "orderly" manner. "Orderly" here means that you should access the matrices block-wise in such a way that they reside in the data cache as long as possible. The simplest rule possible here is: NEVER access a matrix by iterating over columns, always use row-wise access (using FSL conventions). Not taking this effect into account can incur a performance hit of over 500% for large datasets (N larger than say 100 with MM).

The presented code is not a verbatim translation of Mark Smotherman's original "C" code. In the conversion process I have favored constructs that fit better with the Forth language. However, I did not try to change the core ideas of the original algorithms too much. It is stressed that I actually use the given algorithms for my numerical work. This is not an artificial benchmark.

Shown below is the result of the ALL-TESTS word. This word prints the number of floating-point operations executed per second. As this depends on the CPU speed I also added the average number of clock ticks needed per FP operation (F* and F+ in this case). On a current x86 CPU with an exceptionally good compiler the best result you may find is 0.5 tick/flop, and 2 - 6 ticks/flop are very good already. The tick count is a useful figure to compare (with some care) wild mixes of languages, compilers, CPUs and platforms.

The results of the several Forth compilers can be compared with the results for optimized "C" code generated with Microsoft's C++ 4.0. It was not trivial to generate correct code with all the various optimization switches set to the optimal position (mm.c was actual incorrect in a subtle way). You will note that the iForth code for MAT* easily beats C for most algorithms. Exceptions however, are the MAENO and WARNER algorithms. These are ideally suited to the register allocation strategies of a good C compiler, and it shows! Not all is very well though. In the mm.c file I've added two alternative algorithms that make 'minor' modifications to WARNER. These 'minor' changes (removal of local variables) make the code about twice slower. Also note that only the NORMAL and BLOCKING algorithms are hassle-free in use. Both WARNER and MAENO have restrictions on the array size. The iForth MAT* operator has no known restrictions and doesn't need a transposed copy of matrix B. One other remark: the original C code had all arrays allocated statically. This is, of course, unusable in practice, so I've modified the code to allocate dynamically. This increases run-time by about 10% (a previous version of these pages erroneously showed the results of the unrealistic "C" implementation).

With 'r' MEGAFLOPS the software runs the specified algorithm (here 'r') for a number of steadily increasing matrix sizes. On my hardware a very severe performance breakdown is noticeable above a certain size. With n TO N ALL-TESTS you run the tests with a specified size n. I found that very slight changes in n (e.g. from 74 to 75) can cause a speed difference of as large as a factor 2. This has to do with the discrete unroll factors in BLAS level I routines (no code shown).

My conclusions are that at the moment there is no good FP support in most of the Forth compilers tested. The generated code varies from abysmal to mediocre, compared to what a good "C" compiler offers, or what can be done by adding native BLAS support (i.e the iForth MAT* results). In all cases you should opt for the hardware FPU stack, it is between 2 and 3 times faster than software emulation. (IMHO the hardware stack option offered by SwiftForth is not useable because it is too easy to overflow the FPU stack in high-level Forth. E.g. there are FSL algorithms that use more than 8 FP stack positions.)

Unless you are using iForth it doesn't matter much which algorithm is used. In iForth the generic MAT* algorithm (which uses Pentium optimized BLAS) can not be beat. When commercial compilers get better the 'blocking' algorithm with size 20 will be the best choice where N goes above about 100. For smaller matrices the 'normal' algorithm can perform spectacularly better.

Please send (updated) results to mhx@iae.nl. Be sure to mention Forth version, CPU, clockspeed and OS.

Notes


P55-166 MHz, 48 MB, "C" code compiled with MS-C++ 4.0

	500x500 mm - normal algorithm                      5.00 MFlops, utime 49.952 secs
	500x500 mm - blocking, factor of  20              14.00 MFlops, utime 17.856 secs
	500x500 mm - transposed b matrix                  20.26 MFlops, utime 12.338 secs
	500x500 mm - Robert's algorithm                   20.25 MFlops, utime 12.347 secs
	500x500 mm -  20x 20 subarray (from T. Maeno)     30.19 MFlops, utime 8.282 secs
	500x500 mm -  20x 20 subarray (from D. Warner)    39.00 MFlops, utime 6.410 secs

	120x120 mm - normal algorithm                     14.53 MFlops, utime 0.238 secs
	120x120 mm - blocking, factor of  20              14.38 MFlops, utime 0.240 secs
	120x120 mm - transposed b matrix                  32.49 MFlops, utime 0.106 secs
	120x120 mm - Robert's algorithm                   32.11 MFlops, utime 0.108 secs
	120x120 mm -  20x 20 subarray (from T. Maeno)     32.11 MFlops, utime 0.108 secs
	120x120 mm -  20x 20 subarray (from D. Warner)    43.13 MFlops, utime 0.080 secs

	 60x 60 mm - normal algorithm                     17.97 MFlops, utime 0.024 secs
	 60x 60 mm - blocking, factor of  20              14.48 MFlops, utime 0.030 secs
	 60x 60 mm - transposed b matrix                  32.19 MFlops, utime 0.013 secs
	 60x 60 mm - Robert's algorithm                   31.72 MFlops, utime 0.014 secs
	 60x 60 mm -  20x 20 subarray (from T. Maeno)     32.19 MFlops, utime 0.013 secs
	 60x 60 mm -  20x 20 subarray (from D. Warner)    43.20 MFlops, utime 0.010 secs

P55-166 MHz, 48 MB, iForth 1.11, single-precision

	500x500 mm - normal algorithm                            9.66 MFlops,  17.06 ticks/flop,  25.861 s
	500x500 mm - blocking, factor of 20                     24.21 MFlops,   6.81 ticks/flop,  10.323 s
	500x500 mm - transposed B matrix                        36.92 MFlops,   4.46 ticks/flop,   6.770 s
	500x500 mm - Robert's algorithm                         37.65 MFlops,   4.38 ticks/flop,   6.639 s
	500x500 mm - T. Maeno's algorithm, subarray 20x20       10.13 MFlops,  16.27 ticks/flop,  24.655 s
	500x500 mm - D. Warner's algorithm, subarray 20x20      10.44 MFlops,  15.79 ticks/flop,  23.929 s
	500x500 mm - generic iForth MAT*                        78.23 MFlops,   2.10 ticks/flop,   3.195 s

	120x120 mm - normal algorithm                           53.58 MFlops,   3.07 ticks/flop,   0.064 s
	120x120 mm - blocking, factor of 20                     24.13 MFlops,   6.83 ticks/flop,   0.143 s
	120x120 mm - transposed B matrix                        42.58 MFlops,   3.87 ticks/flop,   0.081 s
	120x120 mm - Robert's algorithm                         46.47 MFlops,   3.55 ticks/flop,   0.074 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20       10.11 MFlops,  16.31 ticks/flop,   0.341 s
	120x120 mm - D. Warner's algorithm, subarray 20x20      10.40 MFlops,  15.86 ticks/flop,   0.332 s
	120x120 mm - generic iForth MAT*                        72.49 MFlops,   2.27 ticks/flop,   0.047 s

	60x60 mm - normal algorithm                             38.98 MFlops,   4.23 ticks/flop,   0.011 s
	60x60 mm - blocking, factor of 20                       23.69 MFlops,   6.96 ticks/flop,   0.018 s
	60x60 mm - transposed B matrix                          39.14 MFlops,   4.21 ticks/flop,   0.011 s
	60x60 mm - Robert's algorithm                           45.84 MFlops,   3.59 ticks/flop,   0.009 s
	60x60 mm - T. Maeno's algorithm, subarray 20x20          9.94 MFlops,  16.59 ticks/flop,   0.043 s
	60x60 mm - D. Warner's algorithm, subarray 20x20        10.32 MFlops,  15.97 ticks/flop,   0.041 s
	60x60 mm - generic iForth MAT*                          62.86 MFlops,   2.62 ticks/flop,   0.006 s

P55-166 MHz, 48 MB, iForth 1.11, double-precision, NT 4.0

        500x500 mm - normal algorithm                            6.32 MFlops,  26.09 ticks/flop,  39.543 s
        500x500 mm - blocking, factor of 20                     20.52 MFlops,   8.03 ticks/flop,  12.178 s
        500x500 mm - transposed B matrix                        21.42 MFlops,   7.70 ticks/flop,  11.666 s
        500x500 mm - Robert's algorithm                         21.89 MFlops,   7.53 ticks/flop,  11.418 s
        500x500 mm - T. Maeno's algorithm, subarray 20x20       13.61 MFlops,  12.12 ticks/flop,  18.365 s
        500x500 mm - D. Warner's algorithm, subarray 20x20      11.87 MFlops,  13.89 ticks/flop,  21.054 s
        500x500 mm - generic mat*                               63.49 MFlops,   2.59 ticks/flop,   3.937 s
        500x500 mm - iForth MAT*                                63.50 MFlops,   2.59 ticks/flop,   3.936 s

        120x120 mm - normal algorithm                           27.77 MFlops,   5.94 ticks/flop,   0.124 s
        120x120 mm - blocking, factor of 20                     21.36 MFlops,   7.72 ticks/flop,   0.161 s
        120x120 mm - transposed B matrix                        27.35 MFlops,   6.03 ticks/flop,   0.126 s
        120x120 mm - Robert's algorithm                         29.32 MFlops,   5.62 ticks/flop,   0.117 s
        120x120 mm - T. Maeno's algorithm, subarray 20x20       13.89 MFlops,  11.87 ticks/flop,   0.248 s
        120x120 mm - D. Warner's algorithm, subarray 20x20      12.09 MFlops,  13.64 ticks/flop,   0.285 s
        120x120 mm - generic mat*                               63.67 MFlops,   2.59 ticks/flop,   0.054 s
        120x120 mm - iForth MAT*                                63.62 MFlops,   2.59 ticks/flop,   0.054 s

        60x60 mm - normal algorithm                             38.09 MFlops,   4.30 ticks/flop,   0.011 s
        60x60 mm - blocking, factor of 20                       20.84 MFlops,   7.86 ticks/flop,   0.020 s
        60x60 mm - transposed B matrix                          28.17 MFlops,   5.81 ticks/flop,   0.015 s
        60x60 mm - Robert's algorithm                           32.38 MFlops,   5.06 ticks/flop,   0.013 s
        60x60 mm - T. Maeno's algorithm, subarray 20x20         13.71 MFlops,  11.95 ticks/flop,   0.031 s
        60x60 mm - D. Warner's algorithm, subarray 20x20        11.91 MFlops,  13.76 ticks/flop,   0.036 s
        60x60 mm - generic mat*                                 56.58 MFlops,   2.89 ticks/flop,   0.007 s
        60x60 mm - iForth MAT*                                  56.74 MFlops,   2.88 ticks/flop,   0.007 s

P55-166 MHz, 48 MB, ProForth VFX for WIN32, Version: 3.000.RC5.0018, Build date: 17 December 1999, double-precision, absurd inlining

	500x500 mm - normal algorithm                            4.37 MFlops,  37.49 ticks/flop,  57.157 s
	500x500 mm - blocking, factor of 20                      7.61 MFlops,  21.54 ticks/flop,  32.838 s
	500x500 mm - transposed B matrix                         9.22 MFlops,  17.77 ticks/flop,  27.098 s
	500x500 mm - Robert's algorithm                          9.32 MFlops,  17.58 ticks/flop,  26.802 s
	500x500 mm - T. Maeno's algorithm, subarray 20x20        5.71 MFlops,  28.70 ticks/flop,  43.751 s
	500x500 mm - D. Warner's algorithm, subarray 20x20       7.69 MFlops,  21.30 ticks/flop,  32.480 s 

	120x120 mm - normal algorithm                            8.97 MFlops,  18.39 ticks/flop,   0.385 s
	120x120 mm - blocking, factor of 20                      8.10 MFlops,  20.35 ticks/flop,   0.426 s
	120x120 mm - transposed B matrix                         9.85 MFlops,  16.74 ticks/flop,   0.350 s
	120x120 mm - Robert's algorithm                         10.09 MFlops,  16.33 ticks/flop,   0.342 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20        6.32 MFlops,  26.08 ticks/flop,   0.546 s
	120x120 mm - D. Warner's algorithm, subarray 20x20       8.06 MFlops,  20.45 ticks/flop,   0.428 s 

P55-166 MHz, 48 MB, SwiftForth 2.00.2, NT 4.0, double-precision, software stack

	120x120 mm - normal algorithm                            1.75 MFlops,  94.07 ticks/flop,   1.970 s
	120x120 mm - blocking, factor of 20                      1.63 MFlops, 100.78 ticks/flop,   2.110 s
	120x120 mm - transposed B matrix                         1.74 MFlops,  94.40 ticks/flop,   1.977 s
	120x120 mm - Robert's algorithm                          1.80 MFlops,  91.56 ticks/flop,   1.917 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20        1.42 MFlops, 115.49 ticks/flop,   2.419 s
	120x120 mm - D. Warner's algorithm, subarray 20x20       1.80 MFlops,  91.65 ticks/flop,   1.919 s 

P55-166 MHz, 48 MB, SwiftForth 2.00.2, NT 4.0, double-precision, hardware stack

	120x120 mm - normal algorithm                            5.19 MFlops,  31.77 ticks/flop,   0.665 s
	120x120 mm - blocking, factor of 20                      4.67 MFlops,  35.32 ticks/flop,   0.739 s
	120x120 mm - transposed B matrix                         5.21 MFlops,  31.66 ticks/flop,   0.663 s
	120x120 mm - Robert's algorithm                          5.71 MFlops,  28.89 ticks/flop,   0.605 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20        2.96 MFlops,  55.70 ticks/flop,   1.166 s
	120x120 mm - D. Warner's algorithm, subarray 20x20       3.50 MFlops,  47.02 ticks/flop,   0.985 s 

P55-166 MHz, 48 MB, gforth-f.exe 0.4.9, NT 4.0

	120x120 mm - normal algorithm                            2.20 MFlops,  77.23 ticks/flop,   1.570 s
	120x120 mm - blocking, factor of 20                      1.34 MFlops, 126.61 ticks/flop,   2.573 s
	120x120 mm - transposed B matrix                   	 2.20 MFlops,  77.04 ticks/flop,   1.566 s
	120x120 mm - Robert's algorithm                   	 2.24 MFlops,  75.61 ticks/flop,   1.537 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20 	 1.24 MFlops, 136.26 ticks/flop,   2.770 s
	120x120 mm - D. Warner's algorithm, subarray 20x20	 1.43 MFlops, 118.68 ticks/flop,   2.412 s 

	  60x60 mm - normal algorithm                       	 2.21 MFlops,  74.05 ticks/flop,   0.195 s
	  60x60 mm - blocking, factor of 20                 	 1.28 MFlops, 127.77 ticks/flop,   0.336 s
	  60x60 mm - transposed B matrix                    	 1.98 MFlops,  82.52 ticks/flop,   0.217 s
	  60x60 mm - Robert's algorithm                    	 2.05 MFlops,  79.94 ticks/flop,   0.210 s
	  60x60 mm - T. Maeno's algorithm, subarray 20x20   	 1.18 MFlops, 138.16 ticks/flop,   0.363 s
	  60x60 mm - D. Warner's algorithm, subarray 20x20  	 1.36 MFlops, 120.44 ticks/flop,   0.317 s 

P55-166 MHz, 48 MB, Win32Forth 3.5, NT 4.0

	500x500 mm (aborted - over 6 minutes per benchmark)

	120x120 mm - normal algorithm                            0.93 MFlops, 176.76 ticks/flop,   3.702 s
	120x120 mm - blocking, factor of 20                      0.55 MFlops, 295.18 ticks/flop,   6.182 s
	120x120 mm - transposed B matrix                         0.87 MFlops, 189.46 ticks/flop,   3.968 s
	120x120 mm - Robert's algorithm                          0.90 MFlops, 181.76 ticks/flop,   3.807 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20        0.45 MFlops, 366.00 ticks/flop,   7.666 s
	120x120 mm - D. Warner's algorithm, subarray 20x20       0.57 MFlops, 288.33 ticks/flop,   6.039 s

  	  60x60 mm - normal algorithm                            0.92 MFlops, 178.71 ticks/flop,   0.467 s
	  60x60 mm - blocking, factor of 20                      0.56 MFlops, 290.60 ticks/flop,   0.760 s
	  60x60 mm - transposed B matrix                         0.80 MFlops, 203.84 ticks/flop,   0.533 s
	  60x60 mm - Robert's algorithm                          0.86 MFlops, 191.84 ticks/flop,   0.502 s
	  60x60 mm - T. Maeno's algorithm, subarray 20x20        0.44 MFlops, 372.79 ticks/flop,   0.976 s
	  60x60 mm - D. Warner's algorithm, subarray 20x20       0.53 MFlops, 307.52 ticks/flop,   0.805 s 

PII-350 MHz, 128 MB, iForth 1.11, Linux 2.0, single-precision

        500x500 mm - normal algorithm                           61.25 MFlops,   5.77 ticks/flop,   4.081 s
        500x500 mm - blocking, factor of 20                     74.84 MFlops,   4.72 ticks/flop,   3.340 s
        500x500 mm - transposed B matrix                        67.44 MFlops,   5.24 ticks/flop,   3.706 s
        500x500 mm - Robert's algorithm                         67.76 MFlops,   5.22 ticks/flop,   3.689 s
        500x500 mm - T. Maeno's algorithm, subarray 20x20       16.35 MFlops,  21.64 ticks/flop,  15.285 s
        500x500 mm - D. Warner's algorithm, subarray 20x20      31.50 MFlops,  11.23 ticks/flop,   7.934 s
        500x500 mm - generic iForth MAT*                       242.22 MFlops,   1.46 ticks/flop,   1.032 s 

        120x120 mm - normal algorithm                          136.37 MFlops,   2.58 ticks/flop,   0.025 s
        120x120 mm - blocking, factor of 20                     76.26 MFlops,   4.62 ticks/flop,   0.045 s
        120x120 mm - transposed B matrix                       138.66 MFlops,   2.54 ticks/flop,   0.024 s
        120x120 mm - Robert's algorithm                        144.34 MFlops,   2.44 ticks/flop,   0.023 s
        120x120 mm - T. Maeno's algorithm, subarray 20x20       16.35 MFlops,  21.58 ticks/flop,   0.211 s
        120x120 mm - D. Warner's algorithm, subarray 20x20      31.64 MFlops,  11.15 ticks/flop,   0.109 s
        120x120 mm - generic iForth MAT*                       201.97 MFlops,   1.74 ticks/flop,   0.017 s 

        60x60 mm - normal algorithm                            136.41 MFlops,   2.58 ticks/flop,   0.003 s
        60x60 mm - blocking, factor of 20                       75.14 MFlops,   4.69 ticks/flop,   0.005 s
        60x60 mm - transposed B matrix                         125.37 MFlops,   2.81 ticks/flop,   0.003 s
        60x60 mm - Robert's algorithm                          144.44 MFlops,   2.44 ticks/flop,   0.002 s
        60x60 mm - T. Maeno's algorithm, subarray 20x20         16.29 MFlops,  21.66 ticks/flop,   0.026 s
        60x60 mm - D. Warner's algorithm, subarray 20x20        31.45 MFlops,  11.22 ticks/flop,   0.013 s
        60x60 mm - generic iForth MAT*                         193.59 MFlops,   1.82 ticks/flop,   0.002 s 

PII-350 MHz, 128 MB, iForth 1.11, Linux 2.0, double-precision

        500x500 mm - normal algorithm                           35.41 MFlops,   9.99 ticks/flop,   7.059 s
        500x500 mm - blocking, factor of 20                     65.90 MFlops,   5.37 ticks/flop,   3.793 s
        500x500 mm - transposed B matrix                        39.35 MFlops,   8.99 ticks/flop,   6.351 s
        500x500 mm - Robert's algorithm                         39.57 MFlops,   8.94 ticks/flop,   6.317 s
        500x500 mm - T. Maeno's algorithm, subarray 20x20       15.51 MFlops,  22.81 ticks/flop,  16.115 s
        500x500 mm - D. Warner's algorithm, subarray 20x20      27.38 MFlops,  12.92 ticks/flop,   9.127 s
        500x500 mm - generic mat*                              187.74 MFlops,   1.88 ticks/flop,   1.331 s
        500x500 mm - iForth MAT*                               187.77 MFlops,   1.88 ticks/flop,   1.331 s

        120x120 mm - normal algorithm                           86.43 MFlops,   4.08 ticks/flop,   0.039 s
        120x120 mm - blocking, factor of 20                     67.71 MFlops,   5.21 ticks/flop,   0.051 s
        120x120 mm - transposed B matrix                        61.86 MFlops,   5.70 ticks/flop,   0.055 s
        120x120 mm - Robert's algorithm                         63.82 MFlops,   5.53 ticks/flop,   0.054 s
        120x120 mm - T. Maeno's algorithm, subarray 20x20       15.60 MFlops,  22.61 ticks/flop,   0.221 s
        120x120 mm - D. Warner's algorithm, subarray 20x20      27.69 MFlops,  12.74 ticks/flop,   0.124 s
        120x120 mm - generic mat*                              191.96 MFlops,   1.83 ticks/flop,   0.018 s
        120x120 mm - iForth MAT*                               192.00 MFlops,   1.83 ticks/flop,   0.017 s

        60x60 mm - normal algorithm                             78.82 MFlops,   4.45 ticks/flop,   0.005 s
        60x60 mm - blocking, factor of 20                       66.81 MFlops,   5.25 ticks/flop,   0.006 s
        60x60 mm - transposed B matrix                          58.40 MFlops,   6.00 ticks/flop,   0.007 s
        60x60 mm - Robert's algorithm                           61.70 MFlops,   5.68 ticks/flop,   0.007 s
        60x60 mm - T. Maeno's algorithm, subarray 20x20         15.49 MFlops,  22.65 ticks/flop,   0.027 s
        60x60 mm - D. Warner's algorithm, subarray 20x20        27.47 MFlops,  12.77 ticks/flop,   0.015 s
        60x60 mm - generic mat*                                169.99 MFlops,   2.06 ticks/flop,   0.002 s
        60x60 mm - iForth MAT*                                 170.02 MFlops,   2.06 ticks/flop,   0.002 s 

PII-350 MHz, 128 MB, MPE ProForth VFX for Windows, Version: 3.10, WinNT

	ProForth VFX for i386+, Version: 3.10.0007, Build date: 23 June 2000
	P11-350, 128Mb RAM, WinNT 4.0
	80-bit extended floats for F@ and friends
	Absolutely no FP optimisation at all!

	Using NDP stack
	===============
	500x500 mm - normal algorithm                           19.84 MFlops,  17.48 ticks/flop,  12.599 s
	500x500 mm - blocking, factor of 20                     28.61 MFlops,  12.12 ticks/flop,   8.736 s
	500x500 mm - transposed B matrix                        23.28 MFlops,  14.90 ticks/flop,  10.737 s
	500x500 mm - Robert's algorithm                         22.93 MFlops,  15.13 ticks/flop,  10.900 s
	500x500 mm - T. Maeno's algorithm, subarray 20x20       29.13 MFlops,  11.90 ticks/flop,   8.579 s
	500x500 mm - D. Warner's algorithm, subarray 20x20      31.93 MFlops,  10.86 ticks/flop,   7.828 s 

	120x120 mm - normal algorithm                           38.85 MFlops,   8.93 ticks/flop,   0.088 s
	120x120 mm - blocking, factor of 20                     29.56 MFlops,  11.73 ticks/flop,   0.116 s
	120x120 mm - transposed B matrix                        34.13 MFlops,  10.16 ticks/flop,   0.101 s
	120x120 mm - Robert's algorithm                         34.90 MFlops,   9.94 ticks/flop,   0.099 s
	120x120 mm - T. Maeno's algorithm, subarray 20x20       29.77 MFlops,  11.65 ticks/flop,   0.116 s
	120x120 mm - D. Warner's algorithm, subarray 20x20      33.55 MFlops,  10.34 ticks/flop,   0.102 s 

	60x60 mm - normal algorithm                             37.90 MFlops,   9.20 ticks/flop,   0.011 s
	60x60 mm - blocking, factor of 20                       29.36 MFlops,  11.88 ticks/flop,   0.014 s
	60x60 mm - transposed B matrix                          32.55 MFlops,  10.71 ticks/flop,   0.013 s
	60x60 mm - Robert's algorithm                           33.63 MFlops,  10.37 ticks/flop,   0.012 s
	60x60 mm - T. Maeno's algorithm, subarray 20x20         29.54 MFlops,  11.81 ticks/flop,   0.014 s
	60x60 mm - D. Warner's algorithm, subarray 20x20        33.36 MFlops,  10.46 ticks/flop,   0.012 s 

(submitted by)
	Stephen Pelc, sfp@mpeltd.demon.co.uk
	MicroProcessor Engineering Ltd - More Real, Less Time
	133 Hill Lane, Southampton SO15 5AF, England
	tel: +44 23 80 631441, fax: +44 23 80 339691
	web: http://www.mpeltd.demon.co.uk
(original code, no MFlops, for ProForth VFX for Windows)

AMD Athlon 900 MHz, 128 MB, iForth 1.12.1174, Windows 2000, double-precision

	500x500 mm - normal algorithm                           31.31 MFlops,  28.55 ticks/flop,   7.983 s
	500x500 mm - blocking, factor of 20                    168.74 MFlops,   5.29 ticks/flop,   1.481 s
	500x500 mm - transposed B matrix                       119.44 MFlops,   7.48 ticks/flop,   2.093 s
	500x500 mm - transposed B matrix #2                    121.34 MFlops,   7.36 ticks/flop,   2.060 s
	500x500 mm - Robert's algorithm                        119.88 MFlops,   7.45 ticks/flop,   2.085 s
	500x500 mm - T. Maeno's algorithm, subarray 20x20      199.49 MFlops,   4.48 ticks/flop,   1.253 s
	500x500 mm - D. Warner's algorithm, subarray 20x20     149.88 MFlops,   5.96 ticks/flop,   1.667 s
	500x500 mm - generic mat*                              453.47 MFlops,   1.97 ticks/flop,   0.551 s
	500x500 mm - iForth DGEMM1                             453.30 MFlops,   1.97 ticks/flop,   0.551 s

	120x120 mm - normal algorithm                          265.16 MFlops,   3.37 ticks/flop,  13.033 ms
	120x120 mm - blocking, factor of 20                    193.48 MFlops,   4.62 ticks/flop,  17.861 ms
	120x120 mm - transposed B matrix                       332.64 MFlops,   2.69 ticks/flop,  10.389 ms
	120x120 mm - transposed B matrix #2                    379.05 MFlops,   2.36 ticks/flop,   9.117 ms
	120x120 mm - Robert's algorithm                        363.68 MFlops,   2.46 ticks/flop,   9.502 ms
	120x120 mm - T. Maeno's algorithm, subarray 20x20      243.99 MFlops,   3.66 ticks/flop,  14.164 ms
	120x120 mm - D. Warner's algorithm, subarray 20x20     178.15 MFlops,   5.02 ticks/flop,  19.399 ms
	120x120 mm - generic mat*                              675.42 MFlops,   1.32 ticks/flop,   5.116 ms
	120x120 mm - iForth DGEMM1                             678.68 MFlops,   1.31 ticks/flop,   5.092 ms

	60x60 mm - normal algorithm                            344.42 MFlops,   2.60 ticks/flop,   1.254 ms
	60x60 mm - blocking, factor of 20                      198.41 MFlops,   4.52 ticks/flop,   2.177 ms
	60x60 mm - transposed B matrix                         320.04 MFlops,   2.80 ticks/flop,   1.349 ms
	60x60 mm - transposed B matrix #2                      387.19 MFlops,   2.31 ticks/flop,   1.115 ms
	60x60 mm - Robert's algorithm                          374.21 MFlops,   2.39 ticks/flop,   1.154 ms
	60x60 mm - T. Maeno's algorithm, subarray 20x20        244.99 MFlops,   3.66 ticks/flop,   1.763 ms
	60x60 mm - D. Warner's algorithm, subarray 20x20       180.36 MFlops,   4.97 ticks/flop,   2.395 ms
	60x60 mm - generic mat*                                693.99 MFlops,   1.29 ticks/flop,   0.622 ms
	60x60 mm - iForth DGEMM1                               693.25 MFlops,   1.29 ticks/flop,   0.623 ms

Note: With a 26x26 matrix the Athlon reaches 1113.11 MFlops peak.

AMD Intel PIV 3 GHz, 512 MB, iForth 2.0.x, Windows XP.

	CLK 2998 MHz
	500x500 mm - normal algorithm                       221.68 MFlops,  13.52 ticks/flop,   1.127 s
	500x500 mm - blocking, factor of 20                 395.53 MFlops,   7.57 ticks/flop,   0.632 s
	500x500 mm - transposed B matrix                    643.28 MFlops,   4.66 ticks/flop,   0.388 s
	500x500 mm - transposed B matrix #2                 640.79 MFlops,   4.67 ticks/flop,   0.390 s
	500x500 mm - Robert's algorithm                     637.72 MFlops,   4.70 ticks/flop,   0.392 s
	500x500 mm - T. Maeno's algorithm, subarray 20x20   800.03 MFlops,   3.74 ticks/flop,   0.312 s
	500x500 mm - D. Warner's algorithm, subarray 20x20  527.72 MFlops,   5.68 ticks/flop,   0.473 s
	500x500 mm - generic mat*                          1492.59 MFlops,   2.00 ticks/flop,   0.167 s
	500x500 mm - iForth DGEMM1                         1470.64 MFlops,   2.03 ticks/flop,   0.169 s

	120x120 mm - normal algorithm                      1025.67 MFlops,   2.92 ticks/flop,   3.369 ms
	120x120 mm - blocking, factor of 20                 419.26 MFlops,   7.15 ticks/flop,   8.242 ms
	120x120 mm - transposed B matrix                    975.71 MFlops,   3.07 ticks/flop,   3.542 ms
	120x120 mm - transposed B matrix #2                1044.70 MFlops,   2.86 ticks/flop,   3.308 ms
	120x120 mm - Robert's algorithm                    1016.85 MFlops,   2.94 ticks/flop,   3.398 ms
	120x120 mm - T. Maeno's algorithm, subarray 20x20   916.87 MFlops,   3.26 ticks/flop,   3.769 ms
	120x120 mm - D. Warner's algorithm, subarray 20x20  576.57 MFlops,   5.19 ticks/flop,   5.994 ms
	120x120 mm - generic mat*                          1940.21 MFlops,   1.54 ticks/flop,   1.781 ms
	120x120 mm - iForth DGEMM1                         1939.37 MFlops,   1.54 ticks/flop,   1.782 ms

	60x60 mm - normal algorithm                         956.97 MFlops,   3.13 ticks/flop,   0.451 ms
	60x60 mm - blocking, factor of 20                   414.05 MFlops,   7.24 ticks/flop,   1.043 ms
	60x60 mm - transposed B matrix                      849.18 MFlops,   3.53 ticks/flop,   0.508 ms
	60x60 mm - transposed B matrix #2                   965.94 MFlops,   3.10 ticks/flop,   0.447 ms
	60x60 mm - Robert's algorithm                       881.04 MFlops,   3.40 ticks/flop,   0.490 ms
	60x60 mm - T. Maeno's algorithm, subarray 20x20     868.44 MFlops,   3.45 ticks/flop,   0.497 ms
	60x60 mm - D. Warner's algorithm, subarray 20x20    554.86 MFlops,   5.40 ticks/flop,   0.778 ms
	60x60 mm - generic mat*                            1715.25 MFlops,   1.74 ticks/flop,   0.251 ms
	60x60 mm - iForth DGEMM1                           1699.92 MFlops,   1.76 ticks/flop,   0.254 ms

AMD Athlon 900 MHz, 128 MB, MS-C++ 4.0, Windows 2000, every trick used.

	500x500 mm - normal algorithm                      24.73 MFlops, utime 10.108 secs
	500x500 mm - blocking, factor of  20              124.61 MFlops, utime 2.006 secs
	500x500 mm - transposed b matrix                   78.83 MFlops, utime 3.171 secs
	500x500 mm - Robert's algorithm                   130.25 MFlops, utime 1.919 secs
	500x500 mm -  20x 20 subarray (from T. Maeno)     292.51 MFlops, utime 0.855 secs
	500x500 mm -  20x 20 subarray (from D. Warner)    218.34 MFlops, utime 1.145 secs

	120x120 mm - normal algorithm                     130.85 MFlops, utime 0.026 secs
	120x120 mm - blocking, factor of  20              145.29 MFlops, utime 0.024 secs
	120x120 mm - transposed b matrix                  354.01 MFlops, utime 0.010 secs
	120x120 mm - Robert's algorithm                   378.22 MFlops, utime 0.009 secs
	120x120 mm -  20x 20 subarray (from T. Maeno)     418.28 MFlops, utime 0.008 secs
	120x120 mm -  20x 20 subarray (from D. Warner)    300.20 MFlops, utime 0.012 secs

	 60x 60 mm - normal algorithm                     141.92 MFlops, utime 0.003 secs
	 60x 60 mm - blocking, factor of  20              148.76 MFlops, utime 0.003 secs
	 60x 60 mm - transposed b matrix                  392.73 MFlops, utime 0.001 secs
	 60x 60 mm - Robert's algorithm                   406.78 MFlops, utime 0.001 secs
	 60x 60 mm -  20x 20 subarray (from T. Maeno)     431.14 MFlops, utime 0.001 secs
	 60x 60 mm -  20x 20 subarray (from D. Warner)    312.59 MFlops, utime 0.001 secs

Alternative bench supplied by MPE, AMD Intel PIV 3 GHz, 512 MB, iForth 2.0.x, Windows XP.

	500x500 mm - normal algorithm                       1.121 secs.
	500x500 mm - temporary variable in loop             1.631 secs.
	500x500 mm - unrolled inner loop, factor of 4       1.260 secs.
	500x500 mm - unrolled inner loop, factor of 8       1.147 secs.
	500x500 mm - unrolled inner loop, factor of 16      1.144 secs.
	500x500 mm - pointers used to access matrices       1.371 secs.
	500x500 mm - pointers used, unrolled by 4           1.103 secs.
	500x500 mm - transposed B matrix                    0.694 secs.
	500x500 mm - interchanged inner loops               1.073 secs.
	500x500 mm - blocking, step size of 20              1.249 secs.
	500x500 mm - Robert's algorithm                     0.388 secs.
	500x500 mm - T. Maeno's algorithm, subarray 20x20   0.397 secs.
	500x500 mm - Generic Maeno, subarray 20x20          0.668 secs.
	500x500 mm - D. Warner's algorithm, subarray 20x20  0.747 secs.
	========================================================= =====
	Total using no extensions and using no hackery     14.000 secs. 

	240x240 mm - normal algorithm                       0.616 secs.
	240x240 mm - temporary variable in loop             1.086 secs.
	240x240 mm - unrolled inner loop, factor of 4       0.762 secs.
	240x240 mm - unrolled inner loop, factor of 8       0.659 secs.
	240x240 mm - unrolled inner loop, factor of 16      0.628 secs.
	240x240 mm - pointers used to access matrices       0.810 secs.
	240x240 mm - pointers used, unrolled by 4           0.636 secs.
	240x240 mm - transposed B matrix                    0.732 secs.
	240x240 mm - interchanged inner loops               1.149 secs.
	240x240 mm - blocking, step size of 20              1.343 secs.
	240x240 mm - Robert's algorithm                     0.277 secs.
	240x240 mm - T. Maeno's algorithm, subarray 20x20   0.409 secs.
	240x240 mm - Generic Maeno, subarray 20x20          0.711 secs.
	240x240 mm - D. Warner's algorithm, subarray 20x20  0.799 secs.
	========================================================= =====
	Total using no extensions and using no hackery     10.624 secs.

	60x60 mm - normal algorithm                         0.490 secs.
	60x60 mm - temporary variable in loop               1.104 secs.
	60x60 mm - unrolled inner loop, factor of 4         0.698 secs.
	60x60 mm - unrolled inner loop, factor of 8         0.656 secs.
	60x60 mm - unrolled inner loop, factor of 16        0.738 secs.
	60x60 mm - pointers used to access matrices         0.702 secs.
	60x60 mm - pointers used, unrolled by 4             0.467 secs.
	60x60 mm - transposed B matrix                      1.143 secs.
	60x60 mm - interchanged inner loops                 1.842 secs.
	60x60 mm - blocking, step size of 20                2.016 secs.
	60x60 mm - Robert's algorithm                       0.487 secs.
	60x60 mm - T. Maeno's algorithm, subarray 20x20     0.616 secs.
	60x60 mm - Generic Maeno, subarray 20x20            1.113 secs.
	60x60 mm - D. Warner's algorithm, subarray 20x20    1.233 secs.
	========================================================= =====
	Total using no extensions and using no hackery     13.313 secs.

This page last modified, on: Thursday, August 03, 2006, 23:47 PM

	
Valid HTML 3.0