I’m using CUDA to do 2D convolution on a RGBA image (MxM) (each channel 32-bit float) with a separate kernel (say NxN). The convolution basically does the following. For each pixel, it computes the NxN neighborhood (with the pixel at center of this neighborhood); and then performs the NxN multipications and adds up the results
In order to take advantage of the shared memory, I fetch the relevant data and perform the necessary multipications and additions. In my experimentation, I use N=15, and use a block with 16x16 threads. So I fetch a block of 30x30 for this 16x16 block of pixels, and perform the MADDs and store the result. So the shared memory being used would be 30x30x4x4 bytes ~ 14.4KB. The kernel takes up about 15x15x4 ~ 1KB. So only 1 block can reside inthe shared memory per block.
Now while profiling, if I exclude this copying into the shared memory (so basically, I read junk values for my convolution); and profile the rest of the code, I get a warp occupancy of 0.333. I can’t understand the reason… All my threads do the same amount of work, and there are 256 of them in a block. Also, there are no bank conflicts. Can someone please suggest what’s going on?