Some CUDA/GPU implementation related questions


I have the following questions which struck to me while trying to optimize upon my cuda code:

  1. If the number of threads in the block is an exact multiple of warp-size but the number of threads performing the task (set of operations) is less than a warp-size (and the rest are idle…no task for the “else” condition), I believe the execution of threads is essentially serial then. But what is the impact if the number of threads in any block is not an exact multiple of the warp-size (32)? Does it also lead to serial execution of all the threads?

  2. Streams of GPU kernels overlap execution in the sense that when one stream is performing the kernel execution, the next would be performing data copying to global memory. What if there is not enough memory left for the second stream (since the first one is also occupying some memory) to transfer data?

  3. I believe that the global memory resides in the SDRam on the GPU chip and the SDR would allow simultaneous read/write of only one location (based on the address-bus signal) by the GPU and CPU. So, if overlap by multiple streams called for simultaneous access of the on-card SDR by the CPU and the GPU, would one of them actually stall? If one stalls then how does it happen in a situation where one of the streams is memcpy-ing data onto the global memory while the other is performing computations that involve access to the global memory? If there is frequent access to the global memory, will the CPU slow down or the GPU due to interruptions (if simultaneous access if not allowed) i.e. who will get preference?

  4. Is it possible to have different kernels being executed by different streams? If not, should I use asynchronous memcpy-ies to overlap memcpy and execution of different kernels? Or is there another way?

Would be thankful if someone could reply.

Thanks & regards,


Based on my understanding,

  1. I don’t think having partially-filled warps causes serialization. If you have 2 threads, or 31 threads in a warp, they will still run in parallel, and the other unused threads will do nothing. The ones that do work will still run parallel, not serial.

  2. By not enough memory do you mean cudaMalloc failed? Or do you mean the asynchronous copy fails somehow for lack of memory? I don’t know that the latter is possible. The copy may fail if it is given an invalid pointer or perhaps for other reasons, but why would it need device memory other than the destination buffer?

  3. Bandwidth to the memory is limited. If multiple multiprocessors attempt to use the same portion of device memory, I would expect their throughput to decrease. I would expect the same with asynchronous transfers from the host, but the bandwidth from host to device is much lower, so I doubt asynchronous host transfers could hurt kernel SM-to-memory bandwidth very much.

  4. My understanding is that no two kernels can execute at the same time, though they could execute in both streams, just one will block waiting for the other to complete. You may want to use asynchronous copies if you spend a significant fraction of the time transferring back and forth to the host. This depends on your particular problem.

Its a bad idea to use “non-multiples” of 32 as block size. For the serial question, Jamie has answered it correctly.

But here is why non-multiples of 32 are a bad idea.

A Mulitprocessor has 8 processors. To execute a warp, each processor executes an instruction for 4 threads. Thus 8 processors execute 1 instruction for 32 threads. And, this instruction is the same.

This feature remains the same whether all threads participate in the instruction or not. Even if only 1 thead participates, all 8 processors would execute that instruction for all 32 threads in that warp except that it will be a NO-OP for all other threads that dont participate.


If u are having a block size which is non-multiple of 32 – you have atleast 1 WARP which does NOT have all 32-threads active at all point of time. So, you are just wasting clock cycles.

I think you would first do a cudaMallocHost() before initialting stream copies. So, if there is NO pinned memory - you will know it before.

No idea. Good question. In any case, copying data while the kernel is executng is gnerally considered a good idea for performance.

But in your case - you should first work to get your kernls running faster. With 240 cores or so, you need to be atleast 50x to 100x faster than a single-core CPU (like 2GHz or 3GHz CPU). Then, one can think of optimizing via overlapped memcopies. But if u use double precision, 8x to 40x would be great, I think (depending on your arithmetic intensity).

Yes, I think so. Different streams can execute different kernels. I think streams was introduced to support this overlapped memcpy thing, I guess. Because in a single stream individual operations always happen one after another. If you use “cudaMemcpy” inside a stream, then you are going to get control only after cudaMemcpy finishes – which spoils the whole idea of having streams.

Also Async memcopy only works with “Pinned” memory and so you will get good performance as well. It is super-fast and does NOT eat CPU cycles. The card will pull the data from RAM.

Thanks everyone for the replies…I got a good reply for most of my questions. However few doubts still remain:

  1. About my question2: cudaMallocHost() is for allocating pinned-memory on the CPU while I am talking about global memory which is the device memory. So my question is: Suppose stream[0] is already occupying 70% of the global memory. Now stream[1] tries to do asynchronous memcpy onto the global memory while stream[0] executes. But it doesn’t have enough global memory available!! What then? Do I need to ensure that I should use streams only if there is enough memory available for atleast two of them? This question does not apply for stream-programming alone…the same situation could be encountered if I am using asynch-memcpy instead of mem-cpy.

  2. About my question4: I think I haven’t been able to put forward my question clearly. Consider the following code snippet:

I was doubtful if I could do something like this because I did not come across any example:

From Sarnath’s reply, seems I can.

I had one more important question: When I compile my .cu code in the execution mode with -ptx option, I see something like this:

Please note that I am not using any shared memory or constant variable in my code. So where do “44+40 bytes smem” and “4 bytes cmem[1], 44 bytes cmem[14]” come from?

We already have limited shared memory resources and these smem are varying in quantity for different kernels/codes. Should I account for these bytes when trying to allocate shared memory?



Copying memory requires 2 ends. If u have cudaMallocHost() on CPU, you should also be doing a cudaMalloc() on GPU and then initiate a copy. If u dont have space, that cudaMalloc() would have failed already before copying. The operation of copying does NOT involve any allocation – so I dont understand what u r worrying about.

Kerne launches take a stream argument as well. cudaMemcpyAsync as well takes the stream argument.

See below the example carefully.

Parameters occupy shared memory space. Probably it is just ur arguments. Also, some 20 or 24 bytes of shared memory or so, is always reserved. You cant use entire 16K. There was a thread recently talking about this.

Just ignore those extra things. U should not bother about it.