Achieved Occupancy vs Theoretical

Hello All. Is it normal to get achieved occupancy more than theoretical?

Achieved Occupancy: 0.77 ( Theoretical Occupancy: 0.73 )

Screenshot from Profiler is added.

Allo!

The occupancy of EACH SM is calculated as the number of active warps divided by the maximum of active warps for the compute capability X.
In your case, the 470 GTX is 2.0, has 48 active warps as maximum. This is, you can have occupancy values between [1/48, 2/48, …, 48/48]

The theoretical occupancy value (0,77) indicates that you have 37 active warps by SM (obviously less than the 48 as maximum). The other occupancy (achieved), i think that can be the occupancy for one SM, the one what the profiler is analyzing. In other words, the theoretical can be the mean of the occupancy of all SMs and the achieved is for one concrete SM.

As similar case occurs with the cta_launched counter: This counter returns, for the same kernel, different values depending the TPC (3 SMs) analyzed by the profiler.

Regards!!

if theoretical occupancy is 0.77, it means i can run 37 active warps on ANY sm (not more because of resource usage). i understand when achieved is less than theoretical: it is because less warps are resident on sm (not enough input data). So still do not understand.

That is indeed odd. I have seen theoretical>achieved many times but I have never seen achieved>theoretical It may be a bug in the visual profiler. Would you be able to provide me with a simple reproducer?

Justin

I can not post project (system is not allowing). Any suggestions how to overcome it?

Here is kernel source:

[codebox]

extern “C” global void Sum1D(float *d_Input, float *d_Output, int input_size)

{

float sum = 0;



int xIndex = blockIdx.x * 160 * 256 + threadIdx.x;

#pragma unroll

for (int k = 0; k < 256; k++)

	sum += d_Input[xIndex + k * blockDim.x];

d_Output[blockIdx.x * blockDim.x + threadIdx.x] = sum;

}

[/codebox]

kernel run parameters:

input array is 8192 * 8192 size.

block size is 160

grid size is 1639

this kernel uses 22 registers per thread and then occupancy is 83%, not 0.77 or 0.73

Profiler tells that kernel uses 21 registers and occupancy is 0.73. But by calculations we get 0.83 (8 * 160 / 1536). Bug?