CUDA and NPP Misc Issues

TCubed · March 26, 2011, 5:38am

Hi All!

Im new to the forum, and relatively new to programming nVidia GPUs with CUDA. I am using ‘CUDA By Example’ as a guide. I have downloaded the SDK for C runtime CUDA, and also NPP.

I have some general questions that I would appreciate any help on:

First one: In the best practices document for programming with CUDA, they talk about how one can use asynchronous copy commands such as cudaMemcpyAsync. This doesnt block the CPU thread, and allows one to transfer data to the device (GPU) RAM from the host (CPU) RAM. So far so good. You use a non-default stream number for that, like say, ‘stream1’. Suppose you also wanted to process that data once its copied to the device, so you can call your kernel right afterwards, using stream2. So in this simple example:

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);

The above code will start transferring data from the host to the device via stream1 WITHOUT blocking the CPU, so right after the device starts to process the data (while its being transfered?) via stream2. Is my understanding here correct?

Next, and this is the part that really confuses me, they show the following code:

[i]size=Nsizeof(float)/nStreams;
for (i=0; i<nStreams; i++)
{
offset = iN/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);
}

for (i=0; i<nStreams; i++)
{
offset = iN/nStreams;
kernel<<<N/(nThreadsnStreams), nThreads, 0, stream>>>(a_d+offset);
}

Apparently this will start stream1 to copy, and then as soon as it is done, start processing on stream1 as well. Then while processing stream1, copying on stream2 starts, etc etc. How can this possibly be? Can someone walk me through the exact order this is going to be executed in? I really appreciate it!

My second question is in regards to NPP - quite simply, I was looking at its documentation, and I couldnt for the life of me find a command for multiplying two vectors together… they have one for add, subtract, but none for multipies… I checked BLAS1 BLAS2 and BLAS3 level function calls. Anyone know anything about this?

Thanks in advance everyone!
-TCubed

hamster143 · March 26, 2011, 6:38am

What exactly confuses you? The device has a dedicated unit (a DMA engine) (in Fermi architecture, actually, two of them) whose purpose is to transfer data to and from video memory. Once it is done copying the first block, it informs the part responsible for scheduling kernels, the kernel from stream 1 begins execution, the transfer unit moves on to the next block.

TCubed · March 26, 2011, 7:14am

Hamster,

Thanks for your reply, my confusion is the following: First off, do all the streams in the first for loop that copy data to the device run in parallel? (At the same time?) It seems like it is not, so the second part of the confusion is, once stream1 runs in the first loop, how can the process on stream1 run from the second loop if the first loop hasnt iterated through all its copy streams yet?..

Thanks,

hamster143 · March 26, 2011, 9:19am

No, streams in the first for loop don’t run in parallel. They run sequentially. As to the second part, all that cudaMemcpyAsync does is queue a request for a memory transfer, it does not wait for the transfer to occur (that’s the whole purpose of async). So, it’s quite fast, and your code immediately moves on to the second loop.

Similarly, the act of launching the kernel (the line with <<< >>> ) is also quite fast, because it also queues the request, but does not wait for its completion. So if you put that whole fragment into an actual program, you’ll find out that the whole fragment only takes something like 0.1 seconds to execute, even though the list of requests it generated may take much longer for the GPU to fulfill.

TCubed · March 27, 2011, 11:52pm

Hamster,

Thanks again - so let me get this straight:

The streams in the for loop do not run in parallel - the first for loop simply goes through the one by one, and queues up the 4 streams to copy data from host to device. It then moves on to the second for loop and queues up the 4 kernel executions to process the data. Since stream1 processes data and it is currently busy, processing data being copied from stream1 doesnt start by stream1 until its finished copying. However, when stream1 finally finished copying and starts to be busy executing, stream2 can now begin to copy data, etc. So streams can run in parallel, (stream1 running kernel while stream2 copies new data). (As the figure from the document is showing)…

Is this correct?

Thanks,

hamster143 · March 28, 2011, 8:10am

That is correct. Just one minor clarification. Kernels from different streams can run concurrently too, so, if the second stream is done copying before the first stream is done executing its kernel, the second kernel may be launched in parallel with the first. (Of course, you still have the same amount of compute units that will need to be shared between the two kernels now, so that may not give you any performance advantage.)

TCubed · March 28, 2011, 3:05pm

Hamster,

Thanks once again. When you say the second may be launched in parallel with the first kernel, is this something that is done by the device itself, or something the user must program in for it to do? I was not aware that that was possible…

-TCubed

Topic		Replies	Views
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2231	October 26, 2016
My streams are not running concurrently CUDA Programming and Performance	7	1761	March 6, 2018
Cannot get any stream parallelism. CUDA Programming and Performance	13	1264	December 31, 2019
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5259	April 2, 2008
confusions about CUDA streams CUDA Programming and Performance	5	797	July 30, 2017
Using CUDA to run many instances CUDA Programming and Performance	10	3302	April 1, 2012
A few new to CUDA questions CUDA Programming and Performance	3	1111	February 4, 2011
How to parallel cudaMemPrefetchAsync to different GPUs CUDA Programming and Performance	7	745	May 24, 2023
\|\| programming, basic question CUDA Programming and Performance	18	1289	April 30, 2018
Cuda with openMP CUDA Programming and Performance	10	15467	June 13, 2010

CUDA and NPP Misc Issues

Related topics