Hi All!
Im new to the forum, and relatively new to programming nVidia GPUs with CUDA. I am using ‘CUDA By Example’ as a guide. I have downloaded the SDK for C runtime CUDA, and also NPP.
I have some general questions that I would appreciate any help on:
First one: In the best practices document for programming with CUDA, they talk about how one can use asynchronous copy commands such as cudaMemcpyAsync. This doesnt block the CPU thread, and allows one to transfer data to the device (GPU) RAM from the host (CPU) RAM. So far so good. You use a non-default stream number for that, like say, ‘stream1’. Suppose you also wanted to process that data once its copied to the device, so you can call your kernel right afterwards, using stream2. So in this simple example:
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);
The above code will start transferring data from the host to the device via stream1 WITHOUT blocking the CPU, so right after the device starts to process the data (while its being transfered?) via stream2. Is my understanding here correct?
Next, and this is the part that really confuses me, they show the following code:
[i]size=Nsizeof(float)/nStreams;
for (i=0; i<nStreams; i++)
{
offset = iN/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);
}
for (i=0; i<nStreams; i++)
{
offset = iN/nStreams;
kernel<<<N/(nThreadsnStreams), nThreads, 0, stream>>>(a_d+offset);
}
Apparently this will start stream1 to copy, and then as soon as it is done, start processing on stream1 as well. Then while processing stream1, copying on stream2 starts, etc etc. How can this possibly be? Can someone walk me through the exact order this is going to be executed in? I really appreciate it!
My second question is in regards to NPP - quite simply, I was looking at its documentation, and I couldnt for the life of me find a command for multiplying two vectors together… they have one for add, subtract, but none for multipies… I checked BLAS1 BLAS2 and BLAS3 level function calls. Anyone know anything about this?
Thanks in advance everyone!
-TCubed