cublas problem with very big matrixes and cublasDgemm slow

Hello,

Continuing from my last topic where I managed to launch Cuda and CublasC from Fortran: https://devtalk.nvidia.com/default/topic/995389/cuda-programming-and-performance/problem-using-cuda-as-a-static-library-with-c-and-fortran-on-vs2012/

I tried to test the process and how fast cublas gpu computing in comparison with Intel MKL. So I was testing sgemm fortran, dgemm fortran, cublas sgemm and cublas dgemm. And I noticed two main problems:

  1. reaching a certain size for the matrixes cublas dgemm and cublas sgemm don’t work. For 5000 x 5000, cublas dgemm while for 9000 x 9000 cublas sgemm don’t work, knowing that sgemm fortran and dgemm still compute. The error I get is CUBLAS_STATUS_MAPPING_ERROR in cublasGetVector when I want to copy the result from device to host. At least that’s when the error show it could be before. I suspected problem with stack or heap and tried to set their sizes higher but It didn’t work.

I have a Nvidia GT 7500M, so maybe the problem comes from the limitation of my graphic card. Knowing that I am only using it for testing cuda, the final program would run on a distant server wich have a Tesla graphic card.

  1. Cublas dgemm is very slow, It’s 10 times slower than cublas sgemm, It’s even slower than CPU. For 3000 x 3000:

sgem CPU: 721 ms
dgemm CPU: 1419 ms

cublas sgemm GPU: 153 ms
cublas dgemm GPU: 1878 ms

I uploaded the vs project in a zip file: https://ufile.io/1c7b2

A GT 750m (you said GT 7500M, I have no idea what that is) has a throughput ratio of 1:24 for FP64 operations (FMA) vs. FP32 operations. So you are going to see a large difference in performance for Sgemm vs. Dgemm. Your result there is not surprising.

There are Tesla GPUs that have a 1:3 ratio (e.g. K80) and a 1:2 ratio (e.g. P100). Your CPU appears to have a 1:2 ratio which I think is expected.

Regarding the failure, my guess would be that you are running into a windows TDR event. It may also be a memory size issue, but that seems less likely. The Sgemm test (9000x9000) should require approximately 1GB, the dgemm (5000x5000) should require approximately 600MB, and in any event if it were a memory size issue I would expect an allocation error, but it may still be possible.

Yeah I meant GT 750m, I misstyped.

Thanks it was actually the windows TDR (I didn’t know what it was before you told me), i searched and I knew how to disable it through nsight.

It works now thanks.