Hi everyone,
I’m trying to use a reduction-like function implemented in CUBLAS. It is cublasIdamin. In the prevous versions of cublas (cublas.h), the function call was
...
int id=cublasIdamin(nparedinclulle,d_vDt,1)-1;
...
where id was allocated in CPU. With the implementation of CUBLAS_V2, theoricaly you can store the result in the CPU,
...
int *id;
cublasIdamin(handle,nparedinclulle,d_vDt,1,id);
...
or in the GPU
....
int *d_id;
cublasHandle_t handle;
stat = cublasCreate (&handle);
cudaMalloc((void**)&d_id,sizeof(int));
cublasIdamin(handle,nparedinclulle,d_vDt,1,d_id);
...
But this appears to fail when this function is called. In the CUBLAS library manual (V.4.0) Pg. 22, it says that result can be retrieved to host or device memory, but it explodes in the second case.
In my code, it is neccessary to store it in the GPU in order to decrease the time wich it spends. If you profile the code, the result in the cublasIdamin call is as it appears:
[font=“Courier New”]
(time) GPU CPU Occ. GPU% CPU%
memcpyHtoD 1920 5000 250 0,02 0,06
memset32_aligned1D 1280 3000 125 0,02 0,04
memset32_aligned1D 19360 3000 667 0,24 0,04
_Z12iamin_kernelI… 4992 4000 83 0,06 0,05
_Z12iamin_kernelI… 2368 4000 21 0,03 0,05
_Z44copy_deref…(*) 1600 6713000 0,02 84,07
memcpyDtoH 2304 8000 0,03 0,10
(*) _Z44copy_dereferenced_incremented_element_kernel
[/font]
¿Can anybody explain me how to store the result on GPU when launching a cuBLAS function?
Thanks!