CUBLAS Pointer Mode

Reading section 2.5 in CUBLAS documentation, specifically:

“We recommend using the new CUBLAS API with scalar parameters and results passed by reference in the device memory to achieve maximum overlap of the computation when using streams.”

Ok got it, I’m on the new API and using CUDA 4.1 but have always been using host pointer mode. So I decided to switch over and give it a whirl. All I changed in my existing code was this. I was surprised to see a significant performance decrease (~7-10%) in using device mode over host mode for the scalars. What am I missing? This is using cublasCgemm calls. So is it only helpful when a scalar result to pass back as well? See below results (MHz is my metric of complex samples per second of processing) which are an average over 100 iterations for each case. I’m just cloning the work into multiple streams which is why performance is dropping off.

Host Pointer Mode:
1 Streams, 682.12 MHz
5 Streams, 147.302 MHz
10 Streams, 74.998 MHz
25 Streams, 29.9334 MHz
50 Streams, 15.0168 MHz
75 Streams, 10.0082 MHz
150 Streams, 5.00514 MHz

Device Pointer Mode:
1 Streams, 622.17 MHz
5 Streams, 136.295 MHz
10 Streams, 69.8719 MHz
25 Streams, 27.8062 MHz
50 Streams, 13.9832 MHz
75 Streams, 9.32487 MHz
150 Streams, 4.66303 MHz

Nobody else has tried anything like this?

I just ran cublasCgemm on my C2050 using device pointer mode and Host pointer mode on a 2Kx2K matrices and I got 810.4 Gflops (host) versus 809.6 Gflops (so less than 1 Gflops slower). So I do not really understand how you can get 7-10% decrease.
Can explain a bit more what you are doing and you show how you implemented it?

On more general note:

Device Pointer Mode ( alpha,beta located on the device) does not speed up the kernel per se : it actually slows it slightly down because you need to read the value of alpha and beta from global mem instead of constant memory.
But it allows to launch asynchronously multiple BLAS routines on the GPU and then get a better occupancy

Typical sequence : alpha is computed by a kernel and then used as an input for a subsequent kernel
Example:
cublasSetPointerMode( CUBLAS_POINTER_MODE_DEVICE)
cublasDnorm (handle, n, v, incx, &alpha ); //alpha on the device
cublasDscal( handle, m, &alpha, w, incx);

In this case, with the “device Pointer mode” you can launch the 2 kernels in one shot. With the “host pointer mode”, the Host need to wait the result of cublasDnorm before to even launch cublasDscal. So overall, you get a better occupancy of the GPU (thus better overall perf) and also better usage of the CPU even though every individual kernel is slightly slower.

Thanks for the response. I’m using the same values of alpha & beta over and over again for every kernel, no dependency on one kernel launch to the next. I’ve got a double nested for loop kicking off calls as follows:

Take Host Start Time

For Each Iteration (smooth out results and make sure consistent)

-For Each Stream Used

–cublasSetStream

–cublasCgemm

-End For

End For

cuCtxSynchronize

Take Host End Time

Took difference of times (not performing any memory transfers as wanted to ignore that) and just swapped from host to device pointer mode and changed alpha and beta to device pointers and performed again. And yes, all memory transactions including alpha and beta setup on device are outside the timing loop. So this is where I saw the consistent drop off. I can understand not seeing any difference, but a notable decrease??