Call cublas API from kernel

Hi, i want to execute the cublas api from kernel, actually my configuration launch 2 blocks (it’s only an example) and execute a kernel like this (it’s compute a matrix vector multiplication and each thread compute a dot product):

__global__ void product(double *dev_a0, double *dev_a1, double *dev_A0, double *dev_A1, double *result, int max, int n){
	int i;
	double prod = 0.0;

	for (i = 0; i < max; i++) {
		if(blockIdx.x == 0) {
                        //i want to call cublas dotproduct here!!!
		        prod = prod + dev_a0[i] * dev_A0[i + n * threadIdx.x];
	        }
		else if(blockIdx.x == 1) {
                        //i want to call cublas dotproduct here!!!
			prod = prod + dev_a1[i] * dev_A1[i + n * threadIdx.x];
		}
	}
	__syncthreads();
        //each block write the result in a column
	result[threadIdx.x + n * blockIdx.x] = prod;
}

It’s possible to call cublas API for dot product?

Yes, you can use the cublas API from kernel code if you are running on a compute capability 3.5 device or higher as mentioned in the documentation:

http://docs.nvidia.com/cuda/cublas/index.html#device-api

The simpleDevLibCublas cuda sample code/project should be instructive:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simpledevlibcublas-gpu-device-api-library-functions–cuda-dynamic-parallelism-

Ok txbob thanks to this reply. I am concerned that placing the call of the cublas API, in the IF could create a divergence in the kernel execution. My question then is: is correct handle the execution on the multiprocessors (so checking the IDs of the blocks)?

Any if statement could cause divergence. That is true with or without CUBLAS, with or without dynamic parallelism.

I don’t understand your question: