cublas from device runs only with one thread!

Hello ,
I am having some data with number of pixels for example 200000.
I am running the code (cublas calls from device) with only 1 thread and 1 block and I am getting the right results!!!

How can this happen??

If I try to increase the number of threads/block in order to correspond to the number of pixels I want to process , the code runs very slowly and it may hang.

Or If I use for example 16 threads and 16 blocks ( remember 200000 pixels ) ,the code runs mch slower!

I am providing a simple code ( not the above I am working on ) to make my point.

Here, I am just calling cublas to make a copy from the input array devA to the output devB.
And it works with just 1 thread! ,where number of elements are 32!

#include <assert.h>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>

using namespace std;

__global__ void myfunc( cublasStatus_t * returnStatus, int N ,float * const devA ,float * const devV,  float * const devB )
{

    cublasHandle_t theCublasHandle;
	cublasStatus_t theCublasStatus = cublasCreate( &theCublasHandle );

    if (theCublasStatus != CUBLAS_STATUS_SUCCESS)
	{
		*returnStatus = theCublasStatus;
		return;
	}

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	if ( i < N)
	{

		theCublasStatus = cublasScopy( theCublasHandle , N , devA + i , 1 , devV + i , 1 );
		theCublasStatus = cublasScopy( theCublasHandle , N , devV + i , 1 , devB + i , 1 );

	}
	cublasDestroy( theCublasHandle );
	*returnStatus = theCublasStatus;

}

int main(){

	int N = 32; 

	// allocate host memory
	float * inA;
    inA = (float*) malloc ( N * sizeof(float));
    assert( NULL != inA );

    //allocate device memory
	float *devB , * devA , *devV;
	CudaErrChk( cudaMalloc( (void **) &devB, N * sizeof(float) ) );
	CudaErrChk( cudaMalloc( (void **) &devA, N * sizeof(float) ) );
	CudaErrChk( cudaMalloc( (void **) &devV, N * sizeof(float) ) );

	for ( int i = 0; i < N; i++ )
	    inA[ i ] = i;

    /* Get handle to the CUBLAS context */
    cublasStatus_t * devCublasStatus;
	cublasStatus_t theCublasStatus;

	CudaErrChk( cudaMalloc( (void**) &devCublasStatus, sizeof(cublasStatus_t) ) );

// define threads and blocks
	int Blocks = 1;
	int Threads = 1;

	CudaErrChk( cudaMemcpy( devA, inA , N * sizeof(float), cudaMemcpyHostToDevice ) );

	myfunc<<< Blocks , Threads>>>( devCublasStatus , N , devA , devV, devB );

	//copy back to host
	CudaErrChk( cudaMemcpy( inA , devB , N * sizeof(float), cudaMemcpyDeviceToHost ) );
	CudaErrChk( cudaMemcpy( &theCublasStatus , devCublasStatus , sizeof(cublasStatus_t), cudaMemcpyDeviceToHost ) );

	if (theCublasStatus != CUBLAS_STATUS_SUCCESS)
	{
		fprintf(stderr,"!!!! CUBLAS Device API call failed with code %d\n",theCublasStatus);
		exit(EXIT_FAILURE);
	}

	cout << "\nAfter : "<<endl;
	for (int i = 0; i < N; i++)
	    cout << inA[ i ]<<endl;
	cout <<endl;

	//clean host memory
	free ( inA );
	
	// clean device memory
	CudaErrChk( cudaFree( devB ) );
	CudaErrChk( cudaFree( devA ) );
	CudaErrChk( cudaFree( devV ) );

	return 0;
}

I also found this https://devtalk.nvidia.com/default/topic/378247/?comment=2699621 but I am not sure I understand it.

When you call cublas from device code, it spawns its own child kernel(s) using dynamic parallelism. These kernels are not limited to the number of blocks/threads you specify for the parent kernel. In fact, the cublas call(s) will be called from each thread you launch, which explains the increasing execution time as you add more threads/blocks to the parent kernel.

OK for the time , but since I am using only 1 thread , how cublas calls are executed for all the elements?

It automatically checks the size of the number of elements ( “N” in this case) and uses accordingly threads?

Thanks!

Yes, it works the same way as it does when you call cublas from the host. When you do a cublas call from the host, you don’t tell it how many threads to use. Same when you call from the device.

Hmm…Thanks!

So , in my above example , the :

if ( i < N)

is useless.

And wehn I am using pointer arithmetic like in my above example:

devA + i

, it automatically increases the number of threads from i to N?
Right?

…But since I am using only 1 thread and 1 block to launch the kernel ,

so i will be 0 always.

So , If I want to use

devA + i

or

devA + i * something

, how this is going to work?

If you only use 1 thread and 1 block to launch the kernel, then construct your cublas call exactly as you would if calling it from the host.

If you want to copy a vector of length N from devV to devA,

theCublasStatus = cublasScopy( theCublasHandle , N , devA, 1 , devV, 1 );

Ok , about that , I understand it.
But you said

"If you only use 1 thread and 1 block to launch the kernel".

From what you said in the above post , I understand that it is necessary to launch 1 thread only for the cublas calls.
Isn’t this right?

But if I have something more complicated ,like:

int i = threadIdx.x + blockIdx.x * blockDim.x;


if ( i < TotalNbOfElmts )
{

    CublasStatus = cublasCcopy( CublasHandle , NbOfC  , inSampleData + i  , TotalNbOfElmts , ( inV  + i * NbOfC ) , 1 );

    ...
}

As you can see, I am jumping by TotalNbOfElmts from inSampleData and writting to inV by using i * NbOfC .

I can’t make it work ,because I am stack with cublas calls.
If I use 1 thread only , the i will have the value 0 only.
So , the

+i

or

i * NbofC

is meaningless.

If I use more threads ( same as the number of elements ) , which is wrong as i understood because each thread will launch another kernel , the code runs very slow and the results are messed up!

How can I handle this?

Thank you!

Another option…

If I don’y use at all thread indexing.So , the above example goes like:

CublasStatus = cublasCcopy( CublasHandle , NbOfC  , inSampleData   , TotalNbOfElmts , ( inV  +  NbOfC ) , 1 );

I removed the " i " index.

  1. Where I had
i * NbofC

, I have now just

NbOfC

.
But how can I express it in a right manner?
Because now the

inV + NbOfC

is processed like:

inV + NbOfC

for every thread or

inV + i * NbOfC

for every thread?

I think its the first ,but how can I accomplish the second?

  1. If I use as number of elements
TotalNbOfElmts

a number 4587520 , it gives me:

an illegal memory access was encountered

Why? This number could handled using : 512 threads and 8960 blocks for example ( I don’t now how cublas shedules these , I am just launching a 1 thread 1 block kernel).

Thanks!