CUBLAS_V2 - Keep results in GPU or return it to CPU? cublasIdamin function in CUBLAS_V2

Hi everyone,

I’m trying to use a reduction-like function implemented in CUBLAS. It is cublasIdamin. In the prevous versions of cublas (cublas.h), the function call was

...

int id=cublasIdamin(nparedinclulle,d_vDt,1)-1;

...

where id was allocated in CPU. With the implementation of CUBLAS_V2, theoricaly you can store the result in the CPU,

...

int *id;

cublasIdamin(handle,nparedinclulle,d_vDt,1,id);

...

or in the GPU

....

int *d_id;

cublasHandle_t handle;

stat = cublasCreate (&handle);

cudaMalloc((void**)&d_id,sizeof(int));

cublasIdamin(handle,nparedinclulle,d_vDt,1,d_id);

...

But this appears to fail when this function is called. In the CUBLAS library manual (V.4.0) Pg. 22, it says that result can be retrieved to host or device memory, but it explodes in the second case.

In my code, it is neccessary to store it in the GPU in order to decrease the time wich it spends. If you profile the code, the result in the cublasIdamin call is as it appears:

[font=“Courier New”]

                                       (time)       GPU      CPU    Occ.    GPU%    CPU%

memcpyHtoD 1920 5000 250 0,02 0,06

memset32_aligned1D 1280 3000 125 0,02 0,04

memset32_aligned1D 19360 3000 667 0,24 0,04

_Z12iamin_kernelI… 4992 4000 83 0,06 0,05

_Z12iamin_kernelI… 2368 4000 21 0,03 0,05

_Z44copy_deref…(*) 1600 6713000 0,02 84,07

memcpyDtoH 2304 8000 0,03 0,10

(*) _Z44copy_dereferenced_incremented_element_kernel

[/font]

¿Can anybody explain me how to store the result on GPU when launching a cuBLAS function?

Thanks!

I am glad that you find the _v2 interface useful to avoid synchronization.

To specify that your resulting pointer is on the device, you need to change the pointer mode:

cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t *mode);

like this:
cublasSetPointerMode_v2(handle, CUBLAS_POINTER_MODE_DEVICE);

Thank you philippev!!

It works fine. I didn’t know about cublasSetPointerMode and I also recommend it to everyone who doesn’t need to extract the variables from the GPU. It performs a lot the code!

Thank you again!