new to cuda Cuda beginner

I’m new to CUDA, can some one help to answer the question – Paul

CPU program

global void increment_gpu(float *a, float b, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N)

a[idx] = a[idx] + b;

}

void main()

{

…

dim3 dimBlock (blocksize);

dim3 dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

I have a question on CUDA parallel programming. In the above example, does “ncrement_gpu<<<dimGrid, dimBlock>>>(a, b, N);” execute on all n threads in parallel in terms of CPU execution time? Do I have to load the function global void increment_gpu(float *a, float b, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N)

a[idx] = a[idx] + b;

}

Before running the main program or does the CUDA compiler do the background work and we only need to run the main as a regular program?

Thank you.

It runs all n threads in parallel on the GPU. The function call is asynchronous so the CPU continues executing after the kernel call.

Since you are using the CUDA runtime API, everything is done for you in the background. It is very convenient.

When using the driver API, you have to explicitly load the code and do a lot more setup. There is no reason to use the Driver API unless you really know for certain that it is what you want and have a good reason for doing so.

Thanks for your response.

What do you mean by “after the kernel call”? Do you mean kernel call completed and returned from GPU or just submitted the kernel call to GPU. If later and if the CPU continues executing, when the kernel call is going to return and how the CPU getting notified.

After the kernel call is submitted, the CPU continues executing.

In 99% of the cases, you don’t need to be notified of when the call completes. Just keep issuing kernel calls and/or device to device memcpy calls and they will queue up and execute in order. It you issue a host <-> device memory call, then there is an implicit synchronization as the CPU stalls in that call waiting for the GPU to complete all previously submitted tasks.

If you need to explicitly synchronize (i.e. for benchmarking purposes), see cudaThreadSynchronize() or the Event API in the programming guide.