i am using cublas for some matrix multiplication. The implementation is not that complicated, but unfortunately it’s much slower than the implemented java code. Probably most of you would agree, that there’s something going wrong - so i try to figure out where the bottleneck is. My first measurements resultes that getMatrix steals most of the time. But as cublas calls are working asynchronous it seems like i need something like a cudaThreadSynchronize() to perform the measurement in the right way. Unfortunately i didn’t find anything into the cublas documentation - does someone know how i can use perform a correct mesurement w/o a cudaThreadSynchronize()? Or is there sth like a work-around, which i can use to simulate the same behavior like implementing a cudaThreadSynchronize?
thanks in advance!