Profiling my code I need some help to understand the output of the visual profiler

pasoleatis · December 15, 2011, 1:25pm

I am trying to profile my code. My program is doing a MC simulation of particles with long range interaction. Every time a particle is move N interactions are computed. There are 2 kernels which do the job, one which calculates the N interaction and 1 which makes the sum and decides acceptance or rejection. Because my problem is not large enough to fill the gpu I use 4 streams with almost not performance loss.

I need some help to understand the information given by the visual profiler.

First I have to mentioned that one my kernels is using 38 registers which gives for 256 threads /block an occupancy of 50%. I tried to play around with the number of registers to increase the occupancy (mainly trying to use mixed precision instead of all double precision), but there was no performance improvements.

The 2 kernels take 12 % of the total gpu time and global memory output as 1.29. Are these numbers good or bad?

I attached a print of the main window from the profiler.

If I double-click on the kernel which makes most work I get this message

Memory Throughput Analysis for kernel newMCenergyarray on device Tesla C2070

Kernel requested global memory read throughput(GB/s): 6.16

Kernel requested global memory write throughput(GB/s): 1.03

Kernel requested global memory throughput(GB/s): 7.18

L1 cache read throughput(GB/s): 35.28

L1 cache global hit ratio (%): 74.56

Texture cache memory throughput(GB/s): 0.00

Texture cache hit rate(%): 0.00

L2 cache texture memory read throughput(GB/s): 0.00

L2 cache global memory read throughput(GB/s): 3.13

L2 cache global memory write throughput(GB/s): 1.03

L2 cache global memory throughput(GB/s): 4.17

Local memory bus traffic(%): 0.00

Global memory excess load(%): -96.62

Global memory excess store(%): 0.70

Achieved global memory read throughput(GB/s): 0.00

Achieved global memory write throughput(GB/s): 1.30

Achieved global memory throughput(GB/s): 1.30

Peak global memory throughput(GB/s): 143.42

Hint(s)

Consider using shared memory as a user managed cache for frequently accessed global memory resources.

Refer to the “Shared Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.

The achieved global memory throughput is low compared to the peak global memory throughput. To achieve closer to peak global memory throughput, try to

Launch enough threads to hide memory latency (check occupancy analysis);

Process more data per thread to hide memory latency;

Consider using texture memory for read only global memory, texture memory has its own cache so it does not pollute L1 cache, this cache is also optimized for 2D spatial locality.

Refer to the “Texture Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.

Factors that may affect analysis

If display is attached to the GPU that is being profiled, the DRAM reads, DRAM writes, l2 read hit ratio and l2 write hit ratio may include data for display in addition to the data for kernel that is being profiled.

The thresholds that are used to provide the hints may not be accurate in all cases. It is recommended to analyze all derived statistics and signals and correlate them with your algorithm before arriving to any conclusion.

The value of a particular derived statistic provided in the analysis window is the average value of the derived statistic for all calls of that kernel. To know the value of the derived statistic corresponding to a particular call please refer to the kernel profiler table.

The counters of type SM are collected only for 1 multiprocessor in the chip and the values are extrapolated to get the behavior of entire GPU assuming equal work distribution. This may result in some inaccuracy in the analysis in some cases.

The counters for some derived stats are collected in different runs of application. This may cause some inaccuracy in the derived statistics as the blocks scheduled on each multiprocessor may be different for each run and for some applications the behavior changes for each run.

MattWarmuth · December 15, 2011, 2:54pm

As the report (and the User’s Guide for the Profiler, by the the way) clearly state, global memory throughput (not output) for both read and write is in GB/sec.

As to what’s ‘good’, I shoot for 50-80% of a card’s theoretical maximum. Getting more than that means a lot more work in hours to get increasingly less performance output (unless you’re a grad student…). For your card (C2070), the theoretical max is 155 GB/sec. (Thank you Wiki). So your ‘Achieved’ number of 1.3 is a little slow compared to 144 (especially the write). Did you notice the report telling you your Achieved is low compared to peak of 143.2 GB/sec, and the recommendations it makes?

You should also download the Excel Occupancy calculator, which will indicate what is limiting your occupancy after you enter things like the compute level for your card, the number of thread, shared memory, registers used, etc.

pasoleatis · December 15, 2011, 3:05pm

Hello,

Thank you for your reply. My occupancy is 50 %, because the kernel is using 38 registers. I tried to reduce the number below 32 by using a mixed float-double precision, but the performance was lower. I am a little confused about this.

For the memory it gives these hints:

Consider using shared memory as a user managed cache for frequently accessed global memory resources.

Refer to the "Shared Memory" section in the "CUDA C Runtime" chapter of the CUDA C Programming Guide for more details. 

The achieved global memory throughput is low compared to the peak global memory throughput. To achieve closer to peak global memory throughput, try to 

Launch enough threads to hide memory latency (check occupancy analysis);

Process more data per thread to hide memory latency;

Consider using texture memory for read only global memory, texture memory has its own cache so it does not pollute L1 cache, this cache is also optimized for 2D spatial locality.

Refer to the "Texture Memory" section in the "CUDA C Runtime" chapter of the CUDA C Programming Guide for more details.

For the limiting factor I got this:

Summary profiling information for the kernel: 

Number of calls:  409600

Minimum GPU time(us):  0.90

Maximum GPU time(us):  10.24

Average GPU time(us):  8.01

GPU time (%):  6.48

Grid size:  [4  1  1]

Block size:  [256  1  1]

Limiting Factor

Achieved Instruction Per Byte Ratio:  39.88 ( Balanced Instruction Per Byte Ratio:  3.58 )

Achieved Occupancy:  0.16 ( Theoretical Occupancy:  0.50 )

IPC:  0.81 ( Maximum IPC:  2 )

Achieved global memory throughput:  1.30 ( Peak global memory throughput(GB/s):  143.42 )

Hint(s) 

The achieved instructions per byte ratio for the kernel is greater than the balanced instruction per byte ratio for the device. Hence, the kernel is likely compute bound. For details, click on Instruction Throughput Analysis.

The kernel occupancy is low. For details, click on Occupancy Analysis.

For the occupancy:

Occupancy Analysis for kernel newMCenergyarray on device Tesla C2070

Kernel details: Grid size: [4 1 1], Block size: [256 1 1]

Register Ratio: 0.9375 ( 30720 / 32768 ) [38 registers per thread]

Shared Memory Ratio: 0 ( 0 / 49152 ) [0 bytes per Block]

Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)

Active threads per SM: 768 (Maximum Active threads per SM: 1536)

Potential Occupancy: 0.5 ( 24 / 48 )

Occupancy limiting factor: Registers

What does it mean “compute bound”?

I will try the textures, though i am not sure how much help will because I am using double precision

Sanjiv.Satoor · February 2, 2012, 6:26am

You can try to use compiler option â€“ maxregcount to reduce the register count. This may cause spilling the registers but it might still be helpful as the occupancy will increase. Since you are not using shared memory, you can change the cache configuration to prefer L1 so that the register spilling does not cause L1 thrashing and is not expensive.

Your application is compute bound. The occupancy is limited by registers, but you are saying that when you tried lower down the registers using mixed precision, the performance was lowered. This is unexpected, but it is a right step to reduce instructions. Though the code is compute bound, the ipc still seems to be lower (0.8). Check the instruction mix if you are using lot of fp64, transcendental, that have lower throughput.

tera · February 2, 2012, 1:26pm

I don’t know if pasoleatis is still working on this. If he is, the most striking inefficiency is the grid size of only four blocks, which is far too small to load a Tesla 2070 (or almost any other GPU).

pasoleatis · February 3, 2012, 9:09am

Hello,

Thank you for reply. It is true the grid size is small, but my system is as well small. I can compensate by running streams and getting 2 (or more) independent systems running in the same time and giving me more measurements in the same amount of time. I have an inefficient program because number of particles is low, but when I use streams I can still get 50 times more measurements compared to a single core run.

Topic		Replies	Views
Visual Profiler Output CUDA Programming and Performance	1	1021	May 2, 2012
help me understand `odd' performance CUDA Programming and Performance	5	1766	June 18, 2010
Profiling Interpretation CUDA Programming and Performance	6	5732	July 31, 2010
Occupancy/ Optimazation How to use Occupancy Calculator, improve performance CUDA Programming and Performance	12	17016	December 7, 2011
Tuning GPU code Profiler output interpretation CUDA Programming and Performance	5	6747	March 26, 2007
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2720	March 5, 2013
Visual Profiler reports higher than possible global mem throughput CUDA Programming and Performance	2	882	July 30, 2010
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4994	April 10, 2009
Kernel bound by instruction and memory latency. CUDA Programming and Performance	3	2064	November 24, 2017
Compute Visual Profiler- global memory throughput Legacy PGI Compilers	1	2894	April 14, 2011

Profiling my code I need some help to understand the output of the visual profiler

Related topics