cudaThreadSynchronize() with cublas figuring out the bottleneck of cublas matrix multipl.

hi together!

i am using cublas for some matrix multiplication. The implementation is not that complicated, but unfortunately it’s much slower than the implemented java code. Probably most of you would agree, that there’s something going wrong - so i try to figure out where the bottleneck is. My first measurements resultes that getMatrix steals most of the time. But as cublas calls are working asynchronous it seems like i need something like a cudaThreadSynchronize() to perform the measurement in the right way. Unfortunately i didn’t find anything into the cublas documentation - does someone know how i can use perform a correct mesurement w/o a cudaThreadSynchronize()? Or is there sth like a work-around, which i can use to simulate the same behavior like implementing a cudaThreadSynchronize?

thanks in advance!

  • bjoern

found it !

now I figured out that my cublas cgemm still needs about 46ms for multiplying an 2800/8 with an 2800/2700 complex matrix. That seems much to high for me. Can someone confirm that?