Profiling Interpretation

Kwyjibo2010 · July 23, 2010, 6:07am

Hi,

I have written a quite straightforward Kernel to do Bayer pattern interpolation (see http://en.wikipedia.org/wiki/Bayer_pattern). Accesses to global memory have a high locality, that is why I chose the following design pattern:

Allocate data with cudaMallocPitch
Upload data with cudaMemcpy2D
Bind pitched memory to a texture
Run kernel on the texture using tex2D to fetch data
Write results directly to another piece of allocated pitched global memory
Download results with cudaMemcpy2D

When profiling the kernel, I get the following results:

GPU Time: 4166.11,
CPU Time: 4299.73,
Occupancy: 0.667,
Grid Size X: 40,
Grid Size Y: 30,
Block Size X: 16,
Block Size Y: 16,
Block Size Z: 1,
dyn Shared Memory per block: 0,
sat Shared Memory per block: 32,
registers per thread:14,
StreamID: 0,
mem Transfer Size:,
mem Transfer host Mem Type: 0,
branch: 81600,
divergent branch: 24000,
instructions: 1392032,
warp serialize: 0,
cta_launched: 1200,
gld_coalesced: 0,
gst_coalesced: 460800,
tex_cache hit: 1065579,
tex_cache miss: 9620

I have some problems interpreting the profiling results and to draw the correct conclusions, thats why I would ask you for help.

Occupancy: Occupancy is quite low, but I don’t understand why. Using the occupancy calculator for devices of Compute Capability 1.2, 256 threads per block at 14 Regiters per thread gives me 100 % occupancy. What can I do to increase occupancy?
Branch and divergent branch: I think there is a possibility to optimize instruction throughput here by avoiding branching/coalescing branching with the warp borders, but what is the difference between branch and divergent branch?
What does cta_launched stand for?
My memory accesse to global memory seem to be coalesced quite well, but what is the difference betwenn gld_coalesced and gsd_coalesced?
The approach using the texture reference seems to work out quite well. I have read about another method, where data is ordered in a space filling curve (see http://forums.nvidia.com/lofiversion/index.php?t77482.html) to make texture fetches faster. Do you think this would be helpful to further decrease execution time?

Thanks in advance,

Kwyjibo

Kwyjibo2010 · July 30, 2010, 2:39pm

No ideas how to improve my kernel? Or is the question too exhaustive or simple?

By the way I have managed to avoid the branching whhich gave me a speedup to 2.3 ms.

shawkie · July 30, 2010, 3:10pm

Hi,

I have written a quite straightforward Kernel to do Bayer pattern interpolation (see http://en.wikipedia.org/wiki/Bayer_pattern). Accesses to global memory have a high locality, that is why I chose the following design pattern:

Allocate data with cudaMallocPitch

Upload data with cudaMemcpy2D

Bind pitched memory to a texture

Run kernel on the texture using tex2D to fetch data

Write results directly to another piece of allocated pitched global memory

Download results with cudaMemcpy2D

When profiling the kernel, I get the following results:

GPU Time: 4166.11,

CPU Time: 4299.73,

Occupancy: 0.667,

Grid Size X: 40,

Grid Size Y: 30,

Block Size X: 16,

Block Size Y: 16,

Block Size Z: 1,

dyn Shared Memory per block: 0,

sat Shared Memory per block: 32,

registers per thread:14,

StreamID: 0,

mem Transfer Size:,

mem Transfer host Mem Type: 0,

branch: 81600,

divergent branch: 24000,

instructions: 1392032,

warp serialize: 0,

cta_launched: 1200,

gld_coalesced: 0,

gst_coalesced: 460800,

tex_cache hit: 1065579,

tex_cache miss: 9620

I have some problems interpreting the profiling results and to draw the correct conclusions, thats why I would ask you for help.

Occupancy: Occupancy is quite low, but I don’t understand why. Using the occupancy calculator for devices of Compute Capability 1.2, 256 threads per block at 14 Regiters per thread gives me 100 % occupancy. What can I do to increase occupancy?

Branch and divergent branch: I think there is a possibility to optimize instruction throughput here by avoiding branching/coalescing branching with the warp borders, but what is the difference between branch and divergent branch?

What does cta_launched stand for?

My memory accesse to global memory seem to be coalesced quite well, but what is the difference betwenn gld_coalesced and gsd_coalesced?

The approach using the texture reference seems to work out quite well. I have read about another method, where data is ordered in a space filling curve (see http://forums.nvidia.com/lofiversion/index.php?t77482.html) to make texture fetches faster. Do you think this would be helpful to further decrease execution time?

Thanks in advance,

Kwyjibo

I don’t think you’re running on a compute 1.2 device. Looks like you’ve got a compute 1.0 or 1.1 device. Using 14 registers then gives you a maximum of 585 threads. With your 256 threads per block thats 2 blocks or 512 threads and an occupancy of 0.667.

I’m not sure but I think branches counts the number of times an if statement was encountered whereas divergent branches counts the number of times threads within the same warp took different paths at an if statement.

Its gld and gst - global load and global store.

My guess is that your memory access patterns are actually sufficiently well defined that you would get higher performance by using shared memory rather than textures. The texture units can only sample at a certain rate and in many cases you will hit this limit before you hit the memory bandwidth limit. On the other hand sometimes the extra complexity of a shared memory approach can increase the instruction count or register usage and bring down performance.

Kwyjibo2010 · July 31, 2010, 1:58pm

Hi,

thanks for your answer. Yes, you’re right, I’m running on a CC 1.1 device, I just queried the properties.

This coincides with my observations.

Okay, this would be another possibility to increase throughput. What do you think about z Ordering the input data. Do you think this is useful at all?

token · July 31, 2010, 2:18pm

I think you can get rid of all the divergent branches if you reorganize your kernel. I think you are currently looking at the RGB pixels for the four cases via the global pixel index and a % operator. It would be better to let one thread handle a complete set of four “bayer” pixels. So you have one fourth the threads in a block, but one thread does four times the work. So all threads execute the same instructions instead of branching for the 4 divergent cases.

Next to that think about using shared mem and textures for gathering the pixel info.

Edit: Ok, just read your 2nd post that you already got rid of the divergence. So forget about the first lines ;)

Kwyjibo2010 · July 31, 2010, 3:45pm

How do you think should I combine shared memory and textures. Can I bind a texture to shared memory? Currently I am read the pixels from the texture reference and write the interpolated result directly to global memory.

token · July 31, 2010, 4:09pm

No you cant bind shared mem to texture. If i remember correct you have some redundand tex fetches if you dont store some of the fetched values in registers, which could bring you to more registers which is bad for occupancy. I dont know if this will bring you the benefit of 100 % occupancy, it was just an idea. Also I am not sure if the tex cache isnt already very good in this example, as all of the texfetches are very coherent to 2d memory spatiality. if you want to use shared mem you’d have to fetch all the values needed within a block ( also the boarder pixels which means to execute some threads twice for mem fetching )

I had to use this in a more sophisticated, adaptive version of the debayering algorithm.

Edit: Shared mem was usefull for me there, because i didnt write floats back, but chars. So I had 4 store calls to global mem instead of one aligned to an int which was possible back then on G80. As your profiling shows the stores as coalesced i dont think you will benefit from it.

Topic		Replies	Views
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4143	June 20, 2011
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011
Shared Mem caching strategy Comparison of benchmark results CUDA Programming and Performance	9	4187	May 11, 2008
Efficient use of shared memory CUDA Programming and Performance	29	4309	December 2, 2019
Texture Memory / Large Data / Global Memory Advice CUDA Programming and Performance	14	10696	May 18, 2010
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18555	December 13, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1952	October 8, 2020
Doubts related to CUDA CUDA Programming and Performance	17	11804	November 18, 2010
Shared memory optimization for algorithm accessing neighbors CUDA Programming and Performance cuda , kernel	5	830	December 15, 2021
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010

Profiling Interpretation

Related topics