First off I’m a newbie here, so perhaps I’m making a gargantuan mistake. But SGEMM in the apple vec libs seems to be only slightly slower than the CUDABLAS sgemm (thunking). (as noted in a followup post the Thunking overhead does not appear to impact the calculation speed of SGEMM for large matricies)
This seems like profoundly poor performance: the 8600M GT is only achieving a quarter of it’s theoretical capacity. Why?? It can’t be algorithmic overhead since, In comparion, the core duo seems be achieving 75% of its theoretical limit including the overhead.
To time this I’m using the Fortran_cuda_blas example from Nvidia and running it on a macbook pro (intel core duo 2, 2.4 GHZ, PCIe x16) with 256MB 8600M GT.
As you can see below the native blas is never more than 25% slower than the 8600M GT cudablas . That is to say essentially no acceleration at all, if I’m not make some mistake.
Now here’s the output summarized.
column1 matrix dimension.
column2 MFLOPS native apple veclib blas
column3 MFLOPS cublas sgemm
column4 MFLOPS cublas pinned sgemm
160 3538 4093 4371
256 5673 8202 9185
512 10368 12529 12100
1024 11382 15235 15876
1344 11513 15278 15804
1600 11867 15524 16011
1632 11899 15588 16033
1664 11830 15474 15902
To get the NVDIDA Fortan_Cuda_blas sgemm example to compile on a mac I had to make the following edits:
add line in fortran.c
#define CUBLAS_FORTRAN_COMPILER CUBLAS_G77
addlines in Makefile:
NAMEBLAS = “FOOBAR”
LIBBLAS = -L/Developer/SDKs/MacOSX10.5.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A -lblas