Some CUDA/GPU implementation related questions

Aditi · May 28, 2009, 10:42pm

Hi,

I have the following questions which struck to me while trying to optimize upon my cuda code:

If the number of threads in the block is an exact multiple of warp-size but the number of threads performing the task (set of operations) is less than a warp-size (and the rest are idleâ€¦no task for the â€œelseâ€ condition), I believe the execution of threads is essentially serial then. But what is the impact if the number of threads in any block is not an exact multiple of the warp-size (32)? Does it also lead to serial execution of all the threads?
Streams of GPU kernels overlap execution in the sense that when one stream is performing the kernel execution, the next would be performing data copying to global memory. What if there is not enough memory left for the second stream (since the first one is also occupying some memory) to transfer data?
I believe that the global memory resides in the SDRam on the GPU chip and the SDR would allow simultaneous read/write of only one location (based on the address-bus signal) by the GPU and CPU. So, if overlap by multiple streams called for simultaneous access of the on-card SDR by the CPU and the GPU, would one of them actually stall? If one stalls then how does it happen in a situation where one of the streams is memcpy-ing data onto the global memory while the other is performing computations that involve access to the global memory? If there is frequent access to the global memory, will the CPU slow down or the GPU due to interruptions (if simultaneous access if not allowed) i.e. who will get preference?
Is it possible to have different kernels being executed by different streams? If not, should I use asynchronous memcpy-ies to overlap memcpy and execution of different kernels? Or is there another way?

Would be thankful if someone could reply.

Thanks & regards,

Aditi

Jamie_K · May 29, 2009, 12:29am

Based on my understanding,

I don’t think having partially-filled warps causes serialization. If you have 2 threads, or 31 threads in a warp, they will still run in parallel, and the other unused threads will do nothing. The ones that do work will still run parallel, not serial.
By not enough memory do you mean cudaMalloc failed? Or do you mean the asynchronous copy fails somehow for lack of memory? I don’t know that the latter is possible. The copy may fail if it is given an invalid pointer or perhaps for other reasons, but why would it need device memory other than the destination buffer?
Bandwidth to the memory is limited. If multiple multiprocessors attempt to use the same portion of device memory, I would expect their throughput to decrease. I would expect the same with asynchronous transfers from the host, but the bandwidth from host to device is much lower, so I doubt asynchronous host transfers could hurt kernel SM-to-memory bandwidth very much.
My understanding is that no two kernels can execute at the same time, though they could execute in both streams, just one will block waiting for the other to complete. You may want to use asynchronous copies if you spend a significant fraction of the time transferring back and forth to the host. This depends on your particular problem.

Sarnath · May 29, 2009, 5:13am

Its a bad idea to use “non-multiples” of 32 as block size. For the serial question, Jamie has answered it correctly.

But here is why non-multiples of 32 are a bad idea.

A Mulitprocessor has 8 processors. To execute a warp, each processor executes an instruction for 4 threads. Thus 8 processors execute 1 instruction for 32 threads. And, this instruction is the same.

This feature remains the same whether all threads participate in the instruction or not. Even if only 1 thead participates, all 8 processors would execute that instruction for all 32 threads in that warp except that it will be a NO-OP for all other threads that dont participate.

So,

If u are having a block size which is non-multiple of 32 – you have atleast 1 WARP which does NOT have all 32-threads active at all point of time. So, you are just wasting clock cycles.

I think you would first do a cudaMallocHost() before initialting stream copies. So, if there is NO pinned memory - you will know it before.

No idea. Good question. In any case, copying data while the kernel is executng is gnerally considered a good idea for performance.

But in your case - you should first work to get your kernls running faster. With 240 cores or so, you need to be atleast 50x to 100x faster than a single-core CPU (like 2GHz or 3GHz CPU). Then, one can think of optimizing via overlapped memcopies. But if u use double precision, 8x to 40x would be great, I think (depending on your arithmetic intensity).

Yes, I think so. Different streams can execute different kernels. I think streams was introduced to support this overlapped memcpy thing, I guess. Because in a single stream individual operations always happen one after another. If you use “cudaMemcpy” inside a stream, then you are going to get control only after cudaMemcpy finishes – which spoils the whole idea of having streams.

Also Async memcopy only works with “Pinned” memory and so you will get good performance as well. It is super-fast and does NOT eat CPU cycles. The card will pull the data from RAM.

Aditi · May 29, 2009, 11:01pm

Its a bad idea to use “non-multiples” of 32 as block size. For the serial question, Jamie has answered it correctly.

But here is why non-multiples of 32 are a bad idea.

A Mulitprocessor has 8 processors. To execute a warp, each processor executes an instruction for 4 threads. Thus 8 processors execute 1 instruction for 32 threads. And, this instruction is the same.

This feature remains the same whether all threads participate in the instruction or not. Even if only 1 thead participates, all 8 processors would execute that instruction for all 32 threads in that warp except that it will be a NO-OP for all other threads that dont participate.

So,

If u are having a block size which is non-multiple of 32 – you have atleast 1 WARP which does NOT have all 32-threads active at all point of time. So, you are just wasting clock cycles.

I think you would first do a cudaMallocHost() before initialting stream copies. So, if there is NO pinned memory - you will know it before.

No idea. Good question. In any case, copying data while the kernel is executng is gnerally considered a good idea for performance.

But in your case - you should first work to get your kernls running faster. With 240 cores or so, you need to be atleast 50x to 100x faster than a single-core CPU (like 2GHz or 3GHz CPU). Then, one can think of optimizing via overlapped memcopies. But if u use double precision, 8x to 40x would be great, I think (depending on your arithmetic intensity).

Yes, I think so. Different streams can execute different kernels. I think streams was introduced to support this overlapped memcpy thing, I guess. Because in a single stream individual operations always happen one after another. If you use “cudaMemcpy” inside a stream, then you are going to get control only after cudaMemcpy finishes – which spoils the whole idea of having streams.

Also Async memcopy only works with “Pinned” memory and so you will get good performance as well. It is super-fast and does NOT eat CPU cycles. The card will pull the data from RAM.

Thanks everyone for the replies…I got a good reply for most of my questions. However few doubts still remain:

About my question2: cudaMallocHost() is for allocating pinned-memory on the CPU while I am talking about global memory which is the device memory. So my question is: Suppose stream[0] is already occupying 70% of the global memory. Now stream[1] tries to do asynchronous memcpy onto the global memory while stream[0] executes. But it doesn’t have enough global memory available!! What then? Do I need to ensure that I should use streams only if there is enough memory available for atleast two of them? This question does not apply for stream-programming alone…the same situation could be encountered if I am using asynch-memcpy instead of mem-cpy.
About my question4: I think I haven’t been able to put forward my question clearly. Consider the following code snippet:

I was doubtful if I could do something like this because I did not come across any example:

From Sarnath’s reply, seems I can.

Aditi · May 29, 2009, 11:23pm

I had one more important question: When I compile my .cu code in the execution mode with -ptx option, I see something like this:

Please note that I am not using any shared memory or constant variable in my code. So where do “44+40 bytes smem” and “4 bytes cmem[1], 44 bytes cmem[14]” come from?

We already have limited shared memory resources and these smem are varying in quantity for different kernels/codes. Should I account for these bytes when trying to allocate shared memory?

Thanks,

Aditi

Sarnath · May 30, 2009, 2:49am

Copying memory requires 2 ends. If u have cudaMallocHost() on CPU, you should also be doing a cudaMalloc() on GPU and then initiate a copy. If u dont have space, that cudaMalloc() would have failed already before copying. The operation of copying does NOT involve any allocation – so I dont understand what u r worrying about.

Kerne launches take a stream argument as well. cudaMemcpyAsync as well takes the stream argument.

See below the example carefully.

cudaStream_t stream[2];

for (int i = 0; i < 2; ++i)

{

 cudaStreamCreate(&stream[i]);

}

float* hostPtr;

cudaMallocHost((void**)&hostPtr, 2 * size);

Each of these streams is defined by the following code sample as a sequence of one

memory copy from host to device, one kernel launch, and one memory copy from

device to host:

for (int i = 0; i < 2; ++i)

{

	cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,

size, cudaMemcpyHostToDevice, stream[i]);

}

for (int i = 0; i < 2; ++i)

{

	myKernel<<<100, 512, 0, stream[i]>>>

(outputDevPtr + i * size, inputDevPtr + i * size, size);

}

for (int i = 0; i < 2; ++i)

{

	cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,

size, cudaMemcpyDeviceToHost, stream[i]);

}

cudaThreadSynchronize();

Sarnath · May 30, 2009, 2:52am

Parameters occupy shared memory space. Probably it is just ur arguments. Also, some 20 or 24 bytes of shared memory or so, is always reserved. You cant use entire 16K. There was a thread recently talking about this.

Just ignore those extra things. U should not bother about it.

Topic		Replies	Views
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4491	October 24, 2008
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19622	July 5, 2011
Overhead of using more than one streams? CUDA Programming and Performance	5	6175	April 14, 2009
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24845	September 6, 2009
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011
Doubts related to CUDA CUDA Programming and Performance	17	11807	November 18, 2010
Basic Cuda Confusion - help CUDA Programming and Performance	9	1901	February 11, 2013
A question about the correspondence between warp and core CUDA Programming and Performance	17	7794	February 1, 2019
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	577	April 6, 2023
Synchronization methods? CUDA Programming and Performance	11	2109	November 7, 2010

Some CUDA/GPU implementation related questions

Related topics