Question regarding warp efficiency...

I’m using CUDA to do 2D convolution on a RGBA image (MxM) (each channel 32-bit float) with a separate kernel (say NxN). The convolution basically does the following. For each pixel, it computes the NxN neighborhood (with the pixel at center of this neighborhood); and then performs the NxN multipications and adds up the results

In order to take advantage of the shared memory, I fetch the relevant data and perform the necessary multipications and additions. In my experimentation, I use N=15, and use a block with 16x16 threads. So I fetch a block of 30x30 for this 16x16 block of pixels, and perform the MADDs and store the result. So the shared memory being used would be 30x30x4x4 bytes ~ 14.4KB. The kernel takes up about 15x15x4 ~ 1KB. So only 1 block can reside inthe shared memory per block.

Now while profiling, if I exclude this copying into the shared memory (so basically, I read junk values for my convolution); and profile the rest of the code, I get a warp occupancy of 0.333. I can’t understand the reason… All my threads do the same amount of work, and there are 256 of them in a block. Also, there are no bank conflicts. Can someone please suggest what’s going on?

Are you sure the compiler didn’t optimize away something here because it sees that you never touch the shared mem array? Check the .ptx assembly.


I checked the .ptx, and the compiler does not prune away any of the computation…

Occupancy as reported by the profiler is a bit under-documented right now. basically, each multiprocessor (on G80) can support 24 32-thread warps at a time. If you have 256 threads per block, and you are using 14K of the 16K shared memory per block, then you will have only 8 warps active per multiprocessor. Thus you are at 33% of the total the machine can support. (This doesn’t take into account the number of blocks – if you have less thread blocks than there are multiprocessors, your actual occupancy is even lower).

Higher occupancy means you can more easily hide latency when one block is stalled by a global load, because when one block is stalled, another block can be computing. So you might try breaking your computation into smaller chunks.

If, however, you are not bottlenecked by global loads, then it may not help to increase your occupancy. The best practice is to experiment, and to make your app parameterized so you can adjust (possibly automatically) for different GPUs.



I am trying to do the same convolution operation with 256threads per block but using 81024 bytes which should allow 66% occupancy but I got 33%. When I use only 81024-24 bytes, I got the expected 66%.

Do you have any idea what’s wrong ?

The manual says:

What’s this “statically allocated memory” ? How do I know how much memory is actually allocated ?

Thank you for your help.

– Ben

Quick clarification. The manual states the maximum number of threads per block is 512; and doesn’t state the maximum number of threads per multiprocessor. I guess what you mean is that number is 768.

Also, according to the manual, the number of registers per thread is a multiple of 64. So the register file is something like 647684Bytes? ~ 192KB?

According to David Kirk’s PPT slides from the ECE498 course at UI Urbana-Champaiign (lectures 8-9), the register file is 32KB.


I suspect that something you changed to increase shared memory by 24 bytes also caused register usage per thread to increase (you can confirm by running nvcc with the -cubin option and looking in the .cubin for the number of registers for your kernel before and after the change). Otherwise occupancy shouldn’t have dropped for that increase.

Since the max threads per multiprocessor is 768, with 256-thread blocks, your occupancy will always be one of 100%, 67%, 33%, or 0% (Only if you get a launch failure due to running out of registers).


Apparently it seems that the input params are put into the shared memory

Mark could you confirm this ?

also about the registers are they also shared in the same way as the shared memory among the differents blocks ? meaning it could be another reason that I don’t get 100% of occupancy ?

Yes, registers are a shared resource, and affect occupancy. More on that to come.

Yes, parameters are passed via shared memory.