Save index of maximum value with cublas

HI. I’m using the cublasIsamax to find the index of the maximum of the vector vectorDev returned by my_func inside a for loop:

int imax;
for (int e = 0; e < npixel; e++) {
       my_func<< <block, thread >> >(vectorDev)
       cublasIsamax(handle, par.nclassi, vectorDev, 1, &imax);
}

I would like to save the results of cublasIsamax in a device vector. Do you know how to achieve this? I guess it should be similar to:

int imax;
max.M = (int*)malloc(sizeof(int) * max.dim);
cudaMalloc(&maxDev, max.dim * sizeof(int));
for (int e = 0; e < npixel; e++) {
       my_func<< <block, thread >> >(vectorDev)
       cublasIsamax(handle, par.nclassi, vectorDev, 1, &imax);
       maxDev[e] = imax - 1;
}
cudaMemcpy(max.M, maxDev, max.dim * sizeof(int), cudaMemcpyDeviceToHost);

but when I try to print the results it does not work.

IMO, the easiest options are

  1. Just used managed memory
  2. Allocate _ _ device _ _, cuBLAS L1 function can return a result to global memory. See parameters

Depending on what you’re trying to accompish, you can try using CUB’s block-wide operations to do everything in the first CUDA kernels. Using cub::BlockScan, you can find the cub::Max value in each block. Then using atomics, find the max across all blocks and store in GMEM.

For further reading, c++ - thrust::max_element slow in comparison cublasIsamax - More efficient implementation? - Stack Overflow

Thank you for the help! I have also an other question: my_func is a kernel that uses N blocks by 1 thread and that executes internally a device function. When N is small (1, 2, 3, 4) the results are correct, but when it is bigger they are not. Do you know a possible reason and maybe how to solve?

__global__ void my_func(){
     int i;
     i = blockIdx.x;
     results[i] = device_func(i);
}

There’s not enough information to debug, but you really should be using blocks with at least 64 threads. I highly suggest you take the CUDA C++ DLI Courses – NVIDIA