For nearly six month now, I am developing a special purpose ray caster based on CUDA that works quite well except for its GFLOPS-utilization of the GPU. In the NSIGHT 2.2 VS2010 profiler, all normally critical values - as branch efficiency, warp issue efficiency or replay overheads - seem to be ok (see below).
According to the profiler’s measurements, the ray caster achieves only 17 % of the GPU’s peak performance (216 vs 1267 GFLOPS).
My thought was that the relatively low number of executed IPC [1.92] (max for compute capability 2.1 is 4) is responsible for this result. As far as I know, this number depends on
the occupancy [0.77] (and with it the number of eligible warps [3.79])
the instruction serialization [0.04] as well as
the execution dependency cycles [16.42] and
the instruction mix [0.65 FMA].
Given these values, I cannot see any other factor that may limit the ray caster that much than the high number of average execution dependency cycles. As the execution dependency [0.49] is also the main stall reason my questions are:
How to decrease these execution dependency cycles?
Is there any other profiler's value that may cause the weak GFLOPS performance?
What is a realistic ratio of achievable performance any way?
For completion:
I use only float precision operations.
The global load cache hit rates are not very good because the ray caster copies the triangles from their acceleration structure into a special shared memory cache only once and leaves them and the global memory untouched further on (except for local memory accesses of course).
Thank you for your time,
Tratos
Profiler Measurements
General
GPU: Geforce 560 Ti
compute capability: 2.1
code generation: compute_20,sm_21
number or CUDA cores: 384
peak performance: 1267 GFLOPS
Setting
registers per thread: 20
dynamic shared memory: 6,748 bytes
local memory: 25,952,256 bytes
grid dimension: {68,68,1}
block dimension: {32,2,4}
Occupancy
achieved occupancy: 0.77
theoretical occupancy: 1.00
Instruction Statistic
GPU issued IPC: 2.00
GPU executed IPC: 1.92
GPU SM activity: 1.00
GPU serialization: 0.04
Branch Statistic
branch efficiency: 0.92
divergence: 0.08
control flow efficiency: 0.92
Issue Efficiency
active warps per active cycle: 36.73
eligible warps per active cycle: 3.79
execution dependency cycles (short): 16.42
execution dependency cycles (long): 5.31
max dependency utilization: 0.92
warp issue efficiency (no eligible): 0.07
warp issue efficiency (one eligible): 0.15
instruction fetch stall reason: 0.35
execution dependency stall reason: 0.49
data request stall reason: 0.08
synchronization stall reason: 0.02
other stall reason: 0.06
Memory Statistic
global replay overhead: 0.01
local replay overhead: 0.06
local replay overhead: 0.06
shared replay overhead: 0.01
bank conflicts/shared requests: 0.03
global transactions/requests (load): 2.44
global transactions/requests (store): 2.00
local transactions/requests (load): 1.04
local transactions/requests (store): 1.00
shared transactions/requests (load): 1.03
shared transactions/requests (store): 2.83
global L1 cache hit rate (load): 0.42
local L1 cache hit rate (load): 0.25
local L1 cache hit rate (store): 0.36
L2 cache hit rate (load): 0.67
FLOPS and Operations
FMA operations percentage: 0.65
MUL operations percentage: 0.24
ADD operations percentage: 0.10
special operations percentage: 0.01
singe FLOP count: 6,873,872,383
runtime: 32 ms
single GFLOPS: 216.63