I have written a quite straightforward Kernel to do Bayer pattern interpolation (see http://en.wikipedia.org/wiki/Bayer_pattern). Accesses to global memory have a high locality, that is why I chose the following design pattern:
Allocate data with cudaMallocPitch
Upload data with cudaMemcpy2D
Bind pitched memory to a texture
Run kernel on the texture using tex2D to fetch data
Write results directly to another piece of allocated pitched global memory
Download results with cudaMemcpy2D
When profiling the kernel, I get the following results:
GPU Time: 4166.11,
CPU Time: 4299.73,
Occupancy: 0.667,
Grid Size X: 40,
Grid Size Y: 30,
Block Size X: 16,
Block Size Y: 16,
Block Size Z: 1,
dyn Shared Memory per block: 0,
sat Shared Memory per block: 32,
registers per thread:14,
StreamID: 0,
mem Transfer Size:,
mem Transfer host Mem Type: 0,
branch: 81600,
divergent branch: 24000,
instructions: 1392032,
warp serialize: 0,
cta_launched: 1200,
gld_coalesced: 0,
gst_coalesced: 460800,
tex_cache hit: 1065579,
tex_cache miss: 9620
I have some problems interpreting the profiling results and to draw the correct conclusions, thats why I would ask you for help.
Occupancy: Occupancy is quite low, but I don’t understand why. Using the occupancy calculator for devices of Compute Capability 1.2, 256 threads per block at 14 Regiters per thread gives me 100 % occupancy. What can I do to increase occupancy?
Branch and divergent branch: I think there is a possibility to optimize instruction throughput here by avoiding branching/coalescing branching with the warp borders, but what is the difference between branch and divergent branch?
What does cta_launched stand for?
My memory accesse to global memory seem to be coalesced quite well, but what is the difference betwenn gld_coalesced and gsd_coalesced?
The approach using the texture reference seems to work out quite well. I have read about another method, where data is ordered in a space filling curve (see http://forums.nvidia.com/lofiversion/index.php?t77482.html) to make texture fetches faster. Do you think this would be helpful to further decrease execution time?
I don’t think you’re running on a compute 1.2 device. Looks like you’ve got a compute 1.0 or 1.1 device. Using 14 registers then gives you a maximum of 585 threads. With your 256 threads per block thats 2 blocks or 512 threads and an occupancy of 0.667.
I’m not sure but I think branches counts the number of times an if statement was encountered whereas divergent branches counts the number of times threads within the same warp took different paths at an if statement.
Its gld and gst - global load and global store.
My guess is that your memory access patterns are actually sufficiently well defined that you would get higher performance by using shared memory rather than textures. The texture units can only sample at a certain rate and in many cases you will hit this limit before you hit the memory bandwidth limit. On the other hand sometimes the extra complexity of a shared memory approach can increase the instruction count or register usage and bring down performance.
I think you can get rid of all the divergent branches if you reorganize your kernel. I think you are currently looking at the RGB pixels for the four cases via the global pixel index and a % operator. It would be better to let one thread handle a complete set of four “bayer” pixels. So you have one fourth the threads in a block, but one thread does four times the work. So all threads execute the same instructions instead of branching for the 4 divergent cases.
Next to that think about using shared mem and textures for gathering the pixel info.
Edit: Ok, just read your 2nd post that you already got rid of the divergence. So forget about the first lines ;)
How do you think should I combine shared memory and textures. Can I bind a texture to shared memory? Currently I am read the pixels from the texture reference and write the interpolated result directly to global memory.
No you cant bind shared mem to texture. If i remember correct you have some redundand tex fetches if you dont store some of the fetched values in registers, which could bring you to more registers which is bad for occupancy. I dont know if this will bring you the benefit of 100 % occupancy, it was just an idea. Also I am not sure if the tex cache isnt already very good in this example, as all of the texfetches are very coherent to 2d memory spatiality. if you want to use shared mem you’d have to fetch all the values needed within a block ( also the boarder pixels which means to execute some threads twice for mem fetching )
I had to use this in a more sophisticated, adaptive version of the debayering algorithm.
Edit: Shared mem was usefull for me there, because i didnt write floats back, but chars. So I had 4 store calls to global mem instead of one aligned to an int which was possible back then on G80. As your profiling shows the stores as coalesced i dont think you will benefit from it.