Slow execution of simpleCublas


I am just beginning with CUDA programming and currently trying out some examples shipped with the CUDA SDK. So, at the moment, I work with the cuBLAS library, however, I am experiencing some severe speed issues. I just changed the line

#define N (275)


#define N (4096)

so that 1024-by-1024 matrices are multiplied with each other and commented out the call to the simple_sgemm and the following part in which the results are compared with each other. Unfortunately, this runs really incredibly slow. In fact, it didn’t even come to an end after waiting for one minute. The native CBLAS version finishes after 34s.

Does anybody have an idea what might cause this issue?

Thanks a lot for your help!