Cuda SGEMM same speed as APPLE veclibs ?

First off I’m a newbie here, so perhaps I’m making a gargantuan mistake. But SGEMM in the apple vec libs seems to be only slightly slower than the CUDABLAS sgemm (thunking). (as noted in a followup post the Thunking overhead does not appear to impact the calculation speed of SGEMM for large matricies)

This seems like profoundly poor performance: the 8600M GT is only achieving a quarter of it’s theoretical capacity. Why?? It can’t be algorithmic overhead since, In comparion, the core duo seems be achieving 75% of its theoretical limit including the overhead.

To time this I’m using the Fortran_cuda_blas example from Nvidia and running it on a macbook pro (intel core duo 2, 2.4 GHZ, PCIe x16) with 256MB 8600M GT.

As you can see below the native blas is never more than 25% slower than the 8600M GT cudablas . That is to say essentially no acceleration at all, if I’m not make some mistake.

Now here’s the output summarized.

column1 matrix dimension.
column2 MFLOPS native apple veclib blas
column3 MFLOPS cublas sgemm
column4 MFLOPS cublas pinned sgemm

160 3538 4093 4371
256 5673 8202 9185
512 10368 12529 12100
1024 11382 15235 15876
1344 11513 15278 15804
1600 11867 15524 16011
1632 11899 15588 16033
1664 11830 15474 15902

ADDENDUM:
To get the NVDIDA Fortan_Cuda_blas sgemm example to compile on a mac I had to make the following edits:

add line in fortran.c
#define CUBLAS_FORTRAN_COMPILER CUBLAS_G77

addlines in Makefile:
NAMEBLAS = “FOOBAR”
LIBBLAS = -L/Developer/SDKs/MacOSX10.5.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A -lblas

Some more info.

The apple Accelerate Framework BLAS when computing seems to ustilize about 170-180% CPU on my dualcore 2.4GHZ core duo 2. Thus if I assume that it can do 4 SIMD multiply and adds per clock cycle per processor then that’s a theoretical peak of 17 GFLOPS (at 180% CPU availaibitly).

The 8600M GT is supposed to have a peak capacity of 61 GFLOPS.

What I observe is that as the matrix size grows (square matricies) that the Apple BLAS tops out at 12 GFLOPS and the CUDABLAS tops out at 16 GFLOPS.

It would appear that SGEMM, including the memory transfers to main memory is quite a bit slower than it’s theoretical maximum. However The Thunking calls do not appear to impact the apparrent speed

the macbook pro has a PCIe x16 buss, and a memory speed of 667MHZ DDR2. The 8600 GT card is rated for 22.4GB/sec Memory bandwidth. However transfer rates do not appear to be the slow step.

Since the matrix Multiply is an N^3 operation and the memory size grows as N^2 I should think that the calculation would be quickly dominated by the N^3 FLOPS and not the transfer rate. The fact that the speed seems to plateau, and the time grow cubicly, it suggest that indeed it is the FLOPS not the transfer that is the limit.

An 8800 series has (IIRC) a flop rate of in the range 350-500 GFLOPS. with the top end comparable to a TESLA.

So does that mean things should just scale? that is if the 8600M GT only achieves 26% of it’s capactiy in a matrix multiply then should an TESLA “only” get 140 GFLOPS, or roughly 11x faster than the CPUs?

You mention you are using the thunking CUBLAS calls. These calls allocate GPU memory, copy the array to the GPU, perform the calculation, copy the answer back, and then deallocate memory. All of the memory handling operations are pretty slow, so it is not surprising the CPU does so well comparatively. This overhead is going to skew your timing results.

In typical usage, you usually allocate GPU memory at the beginning of the program, copy data to the GPU, and then do as many calculations as possible directly on the GPU, only copying the results back when you need them. With multiple operations in a row, CUDA should do much better than the CPU. (Note these are the non-thunking CUBLAS calls.)

As I noted in a secondary post above, the Thunking calls appear to have zero impact of the speed of the large matricies. the speed scales as N^3 as expect for just the multiplies involved, not as N^2 as the data transfer size.

So while I realize that for small matricies that THunking will impose a stiff penalty, I’m of the mind this does not apply to large matricies here. (you can see this in the table in the original post–small matricies achieve a quarter of the FLOPS of the larger matrix)

Perhaps however you know something I do not–I’m new at this. My beleif the thunking calls are irrelevant is the scaling argument. Is there something not right about that?

Ah, ok, somehow the N^3 argument didn’t register in my head. You are probably right then, although it would be nice to check by using the non-thunking calls and timing the allocate, upload, compute, download, deallocate steps separately. Not sure how easy to do that is in Fortran, though.

I also don’t know if the CUBLAS calls are synchronous, or asynchronous like they are in the C API. You might find the compute step seems to happen instantly, in which case the compute time will appear to be spent in the download step, which will block until the computation is finished before copying memory.

If you keep increasing the size, you will see that the GPU is still improving while the CPU has reached a plateau.

Size Accelerate Cublas Cublas pinned
2048 11536 18918 19655
2400 11665 19658 20172

Now, the laptop GPU is close to double the speed of a dual-core Intel CPU (2.4Ghz on my laptop) with all the overhead of allocating memory, copying data to/from card ( this is a real SGEMM, with 3 matrices sent and 1 retrieved).

If you look at the number of SP on the 8600M GT (32) and the clock (750 Mhz) , there is almost a x8 factor compared to a workstation GPU.
Plus there is a substantial lower memory clock and narrow memory interface.

CUBLAS 1.1 is also missing the fast SGEMM implementation by Volkov ( integrated in CUBLAS 2.0) that will double the GPU performance.

So, you are comparing the only piece of software that on x86 CPU reaches 90% of peak performance and you are complaining that a laptop GPU is only twice as fast as a vendor tuned CPU implementation running on a dual core at 2.4Ghz? :-)

I am missing your point.

What Card is this. On my 8600M 256MB it plateaus at 16MFLOPS. With an 8800 I’d expect it would plateau at a proprtionally higher value given the larger number of processors.

Additionally, although it plateaus by around 1024, My 256MB 8600M also seem to give me an segfault before it reaches size 2400. (240024004*3 = 69Megs) I’m assuming this is because a modest amount of memory is in use for the screen.

Which raises another issue. Is there some way to free up memory on a card that is in use for the screen?

I realize that. I was concerned about scaling efficiency. Supposedly the 8600 and 8800 are functionally identical aside from the number of PU, so a 20% efficiency on an 8600 I am guessing, will translate to about the same efficiency (on a proprotionally larger array) on the 8800.

Groovy! I can’t wait. I also noticed the SGEMM implementation in CUDABLAS thunking also has sub-optimal behaviour of transfering the output array from the cpu to the gpu, even when beta=0.0.

Well I’m not sure I was complaining. as I said I’m a newbie at this. So I was surprised that the canonoical linear algebra operation, sGEMM was only 25% efficient. Having tried to find out more now, I learned that this is actually almost double the sGEMM efficiency obtained in 2004. And you now tell me there’s another doubling coming son. In 2004 it was beleived that GPUs may never have great linear algebra efficiency since linear algebra happens to lie in the two blind spots for GPU verus CPU.

  1. operations that re-use their same input memory for multiple fetches (i.e. matrix multiply)

  2. operations that have a high ratio of memory to operations (dot product, summing a vector, argmin)

the Ideal GPU op, I suspect in my meager experience, will be one where each memory item is fetched one time and a large number of operations are performed on it. So GPUs should have a relatively hard time with linear algebra, even though they can beat a CPU by brute force.

My card is a 8600M GT, the standard one in the MacBook Pro.
I am using CUDA 1.1 beta for MacosX ( the same one available on the web).

Can you tell me the the info from the system profiler?
There should be a revision number and device id.

GeForce 8600M GT:

Chipset Model: GeForce 8600M GT

Type: Display

Bus: PCIe

PCIe Lane Width: x16

VRAM (Total): 256 MB

Vendor: NVIDIA (0x10de)

Device ID: 0x0407

Revision ID: 0x00a1

ROM Revision: 3212

Displays:

Display Connector:

Status: No display connected

Apple Cinema Display:

Display Type: LCD

Resolution: 1600 x 1024

Depth: 32-bit Color

Core Image: Hardware Accelerated

Main Display: Yes

Mirror: Off

Online: Yes

Quartz Extreme: Supported

Rotation: Supported

Hardware Overview:

Model Name: MacBook Pro

Model Identifier: MacBookPro4,1

Processor Name: Intel Core 2 Duo

Processor Speed: 2.4 GHz

Number Of Processors: 1

Total Number Of Cores: 2

L2 Cache: 3 MB

Memory: 2 GB

Bus Speed: 800 MHz

Boot ROM Version: MBP41.00C1.B03

SMC Version: 1.27f1

Serial Number: W88133BVYJX

Sudden Motion Sensor:

State: Enabled