When I run the following code on GTX480 through Visual Profiler, the profiler claims that achieved occupancy is 2.21. However, it is supposed to be less or equal to 1.0 - “achieved occupancy” is defined as “Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor”. You can’t get more average active warps than the maximum number of warps, right?
#include <cuda.h>
#include <stdio.h>
__global__ void memset( int4 *p, int x )
{
p[threadIdx.x+blockDim.x*blockIdx.x] = make_int4(x,x,x,x);
}
int main( int argc, char **argv )
{
int grid = 10000, blocksize = 768;
int4 *p;
cudaMalloc( (void**)&p, sizeof(float4)*blocksize*grid );
memset<<<grid,blocksize>>>( p, 0 );
puts( cudaGetLastError() == cudaSuccess ? "success" : "error" );
cudaFree( p );
cudaDeviceReset( ); //profiler fails otherwise
}
I use the most recent version of CUDA, downloaded today.
Can you please provide the operating system and driver version.
The kernel should have a theoretical SM occupancy of 100% (48/48 warps). Unless there is a bug the metric should never exceed 100%. I’ll try to reproduce the issue tomorrow.
Achieved occupancy percentage is defined as active_warps / active_cycles / MAX_WARPS_ON_SM * 100.
On the Fermi architecture the PM signal active_cycles does not increment on all cycles that active_warps > 0. Specifically, active_cycles will not increment if warps are allocated on the SM but all threads of the warps have exited but are waiting on instructions to complete.
The kernel in the question does very little work per warp. The kernel executes a global store then exits. The warp is considered active until the global store is accepted by the L1. Due to the short duration of warps it is possible that the SM may not have enough warps that can execute an instruction to keep active_cycles asserted for the full period.
If the metric achieved occupancy is greater than theoretical maximum for the launch then the metric should be clamped to the theoretical maximum. A future version of the profiler will clamp the value.
Similarly, I get a “Multiprocessor Efficiency” of 108.7% on a GTX Titan under Linux with cuda_5.0.35_linux_64_suse12.1-1.run and the driver from NVIDIA-Linux-x86_64-313.30.run.
According to the tooltip “Multiprocessor Efficiency” is “the ratio of the time at least one warp is active on a multiprocessor to the total time”. This would be a highly interesting metric for optimization. If it gave meaningful results, that is.