Question regarding warp efficiency...

jatinch · March 7, 2007, 3:52pm

I’m using CUDA to do 2D convolution on a RGBA image (MxM) (each channel 32-bit float) with a separate kernel (say NxN). The convolution basically does the following. For each pixel, it computes the NxN neighborhood (with the pixel at center of this neighborhood); and then performs the NxN multipications and adds up the results

In order to take advantage of the shared memory, I fetch the relevant data and perform the necessary multipications and additions. In my experimentation, I use N=15, and use a block with 16x16 threads. So I fetch a block of 30x30 for this 16x16 block of pixels, and perform the MADDs and store the result. So the shared memory being used would be 30x30x4x4 bytes ~ 14.4KB. The kernel takes up about 15x15x4 ~ 1KB. So only 1 block can reside inthe shared memory per block.

Now while profiling, if I exclude this copying into the shared memory (so basically, I read junk values for my convolution); and profile the rest of the code, I get a warp occupancy of 0.333. I can’t understand the reason… All my threads do the same amount of work, and there are 256 of them in a block. Also, there are no bank conflicts. Can someone please suggest what’s going on?

prkipfer · March 7, 2007, 5:08pm

Are you sure the compiler didn’t optimize away something here because it sees that you never touch the shared mem array? Check the .ptx assembly.

Peter

jatinch · March 7, 2007, 7:06pm

I checked the .ptx, and the compiler does not prune away any of the computation…

Mark_Harris · March 7, 2007, 7:28pm

Occupancy as reported by the profiler is a bit under-documented right now. basically, each multiprocessor (on G80) can support 24 32-thread warps at a time. If you have 256 threads per block, and you are using 14K of the 16K shared memory per block, then you will have only 8 warps active per multiprocessor. Thus you are at 33% of the total the machine can support. (This doesn’t take into account the number of blocks – if you have less thread blocks than there are multiprocessors, your actual occupancy is even lower).

Higher occupancy means you can more easily hide latency when one block is stalled by a global load, because when one block is stalled, another block can be computing. So you might try breaking your computation into smaller chunks.

If, however, you are not bottlenecked by global loads, then it may not help to increase your occupancy. The best practice is to experiment, and to make your app parameterized so you can adjust (possibly automatically) for different GPUs.

Mark

benoit · March 9, 2007, 9:40pm

Hello,

I am trying to do the same convolution operation with 256threads per block but using 81024 bytes which should allow 66% occupancy but I got 33%. When I use only 81024-24 bytes, I got the expected 66%.

Do you have any idea what’s wrong ?

The manual says:

What’s this “statically allocated memory” ? How do I know how much memory is actually allocated ?

Thank you for your help.

– Ben

jatinch · March 9, 2007, 9:50pm

Quick clarification. The manual states the maximum number of threads per block is 512; and doesn’t state the maximum number of threads per multiprocessor. I guess what you mean is that number is 768.

Also, according to the manual, the number of registers per thread is a multiple of 64. So the register file is something like 647684Bytes? ~ 192KB?

paulius · March 9, 2007, 10:17pm

According to David Kirk’s PPT slides from the ECE498 course at UI Urbana-Champaiign (lectures 8-9), the register file is 32KB.

Paulius

Mark_Harris · March 12, 2007, 6:59pm

I suspect that something you changed to increase shared memory by 24 bytes also caused register usage per thread to increase (you can confirm by running nvcc with the -cubin option and looking in the .cubin for the number of registers for your kernel before and after the change). Otherwise occupancy shouldn’t have dropped for that increase.

Since the max threads per multiprocessor is 768, with 256-thread blocks, your occupancy will always be one of 100%, 67%, 33%, or 0% (Only if you get a launch failure due to running out of registers).

Mark

benoit · March 12, 2007, 8:13pm

Apparently it seems that the input params are put into the shared memory

Mark could you confirm this ?

also about the registers are they also shared in the same way as the shared memory among the differents blocks ? meaning it could be another reason that I don’t get 100% of occupancy ?

Mark_Harris · March 13, 2007, 3:51pm

Yes, registers are a shared resource, and affect occupancy. More on that to come.

Yes, parameters are passed via shared memory.

Mark

Topic		Replies	Views
better performance from underpopulated warps CUDA Programming and Performance	6	2439	June 28, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5900	July 25, 2007
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2207	December 9, 2009
Optimisation of occupancy summary table CUDA Programming and Performance	10	777	September 15, 2023
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8481	March 28, 2008
Significantly lower device memory bandwidth when using higher thread counts CUDA Programming and Performance	2	187	February 6, 2024
Block size and occupancy CUDA Programming and Performance	12	70	January 2, 2025
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	4979	June 30, 2008
are the threads of a warp run serially？ CUDA Programming and Performance	9	1097	February 27, 2020
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010

Question regarding warp efficiency...

Related topics