simpleCUBLAS vs matmul Implementation

I have evaluated the performance of the grafic card by comparing the computation time of the CPU and of the NVIDIA using matrixmul and simplecublas. Additionally the size of the matrix is variable, so that one can clearly see the speed up when using the grafic card in case of bigger matrices.
However the computation time of matrixmul is always greater than simplecublas. Why is that? (One possible answer might be that the cublas implementations are optimized, btu i am not sure)

  1. If one has a closer on the occupancy calculator in case of matrixmul with threads/Block = 256, registers/thread = 14, shared mem/block = 2048, the percentage of occupancy of each mutli equals 67%.
    As i understand, there are mainly three options in order to increase the performance: changing the thread block dimensions, the shared mem size or the number or registers. In the case of matrixmul, the number of registers used per thread causes in this case a bottle neck.
    What chance do i have in order to increase the occupancy although i dont have any influence on the number of register (as the compiler tries to increase the number of threads while using less registers)?

  2. Although i have not changed any line of matrixmul, the program crashes when the matrixdimension increases to 5120x5120. Memory allocation works fine. I dont have a clue, why the program is crashing :blink: .

Thx in advance for your help.

The matrixMul sample is intended as a simple example of using CUDA. It’s not intended as high-performance code, but as clear code. CUBLAS is much more optimized so it should definitely perform better.

There have been other threads discussing occupancy, and while there are multiple ways you could try to increase it, doing so may or may not improve performance. (Sorry, I know that’s not an answer, but there are other threads that do answer this question.)

Thanks, we’ll have a look.


Could you be more specific about the sort of crash you get?

Also, just checking: If you’re on Windows, with such matrix dimensions, the kernel will exceed the run time allowed by the watchdog timer, so you need to run on a G80 that is not attached to a display and does not have the Windows desktop extended onto it as mentioned in the release notes; is this how you’re setup ?


Thanks a lot. I have just tried to use the G80 as the second “non-extended” grahpic card, in order to compute longer than just 5 seconds, i.e. matrix - dimensions greater than 4096x4096. For this i have enabled the onboard graphic card of my mainboard. However, if the internal graphic card of my mainboard is configured as the primary grahpic card,


deviceCount is 0. If the G80 is selected as the primary graphic card the program seems to work sometimes, but also hangs-up in some cases.

Is your onboard graphics chip an NVidia chip?


nope <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />