CUDA and NPP Misc Issues

Hi All!

Im new to the forum, and relatively new to programming nVidia GPUs with CUDA. I am using ‘CUDA By Example’ as a guide. I have downloaded the SDK for C runtime CUDA, and also NPP.

I have some general questions that I would appreciate any help on:

First one: In the best practices document for programming with CUDA, they talk about how one can use asynchronous copy commands such as cudaMemcpyAsync. This doesnt block the CPU thread, and allows one to transfer data to the device (GPU) RAM from the host (CPU) RAM. So far so good. You use a non-default stream number for that, like say, ‘stream1’. Suppose you also wanted to process that data once its copied to the device, so you can call your kernel right afterwards, using stream2. So in this simple example:

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);

The above code will start transferring data from the host to the device via stream1 WITHOUT blocking the CPU, so right after the device starts to process the data (while its being transfered?) via stream2. Is my understanding here correct?

Next, and this is the part that really confuses me, they show the following code:

for (i=0; i<nStreams; i++)
offset = i
cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);

for (i=0; i<nStreams; i++)
offset = iN/nStreams;
nStreams), nThreads, 0, stream>>>(a_d+offset);

Apparently this will start stream1 to copy, and then as soon as it is done, start processing on stream1 as well. Then while processing stream1, copying on stream2 starts, etc etc. How can this possibly be? Can someone walk me through the exact order this is going to be executed in? I really appreciate it!

My second question is in regards to NPP - quite simply, I was looking at its documentation, and I couldnt for the life of me find a command for multiplying two vectors together… they have one for add, subtract, but none for multipies… I checked BLAS1 BLAS2 and BLAS3 level function calls. Anyone know anything about this?

Thanks in advance everyone!

What exactly confuses you? The device has a dedicated unit (a DMA engine) (in Fermi architecture, actually, two of them) whose purpose is to transfer data to and from video memory. Once it is done copying the first block, it informs the part responsible for scheduling kernels, the kernel from stream 1 begins execution, the transfer unit moves on to the next block.


Thanks for your reply, my confusion is the following: First off, do all the streams in the first for loop that copy data to the device run in parallel? (At the same time?) It seems like it is not, so the second part of the confusion is, once stream1 runs in the first loop, how can the process on stream1 run from the second loop if the first loop hasnt iterated through all its copy streams yet?..


No, streams in the first for loop don’t run in parallel. They run sequentially. As to the second part, all that cudaMemcpyAsync does is queue a request for a memory transfer, it does not wait for the transfer to occur (that’s the whole purpose of async). So, it’s quite fast, and your code immediately moves on to the second loop.

Similarly, the act of launching the kernel (the line with <<< >>> ) is also quite fast, because it also queues the request, but does not wait for its completion. So if you put that whole fragment into an actual program, you’ll find out that the whole fragment only takes something like 0.1 seconds to execute, even though the list of requests it generated may take much longer for the GPU to fulfill.


Thanks again - so let me get this straight:

The streams in the for loop do not run in parallel - the first for loop simply goes through the one by one, and queues up the 4 streams to copy data from host to device. It then moves on to the second for loop and queues up the 4 kernel executions to process the data. Since stream1 processes data and it is currently busy, processing data being copied from stream1 doesnt start by stream1 until its finished copying. However, when stream1 finally finished copying and starts to be busy executing, stream2 can now begin to copy data, etc. So streams can run in parallel, (stream1 running kernel while stream2 copies new data). (As the figure from the document is showing)…

Is this correct?


That is correct. Just one minor clarification. Kernels from different streams can run concurrently too, so, if the second stream is done copying before the first stream is done executing its kernel, the second kernel may be launched in parallel with the first. (Of course, you still have the same amount of compute units that will need to be shared between the two kernels now, so that may not give you any performance advantage.)


Thanks once again. When you say the second may be launched in parallel with the first kernel, is this something that is done by the device itself, or something the user must program in for it to do? I was not aware that that was possible…