Occupancy Always Appears to be a Multiple of 3

I’m working on validating some of Vasily Vokov’s dissertation work, so that I can extend it. He describes a kernel that is essentially a pointer chase with variable arithmetic intensity. He creates some rooflines with Occupancy on the x axis, which he controls by allocating shared memory.

I have coded this up in pyCuda but I’m always getting occupancies that are multiples of 3. For instance, when I run the kernel with BLOCK_SIZE = (32,1,1) and allocate SHARED_MEM/32 (for a goal occupancy of 32 THREADS/WARP), but when I plot thread blocks that run on SM0, it seems the occupancy is 48. When I allocate different amounts of shared memory, I always end up with a multiple of 3 occupancy. Any tips for how to understand this?

__device__ uint get_smid(void) {
        uint ret;
        asm("mov.u32 %0, %smid;" : "=r"(ret) );
        return ret;
}
    
__global__ void copy(float *a, float b, long long int *data, ulong spacing) {

    __shared__ float dummy[{{size}} / sizeof(float)];

    long long int start, end;
    if (threadIdx.x == 0){
        start = clock64();
        dummy[0] = 0;
    }

    float idx = blockIdx.x *spacing + threadIdx.x;
    
    for (int i = 0; i < spacing / 512; i++) {
        for (int j = 0; j < 512; j++) {
            idx = a[(int)idx];
            idx += b;
            idx += b;
            idx += b;
            idx += b;
        }
    }
        
    __syncthreads();  
    if (threadIdx.x == 0) {
        uint smid = get_smid();
        end = clock64();
        data[blockIdx.x*3  ] = smid;
        data[blockIdx.x*3+1] = start;
        data[blockIdx.x*3+2] = end;
    }
    
}
  """

very confusing

Are you aiming for an occupancy of one warp per SM? (ie. 32 threads per SM)

what is SHARED_MEM?

what GPU are you running on?

also, I personally wouldn’t recommend pyCuda for this sort of work. I’m sure it can be done, but pyCuda mostly insulates you from nvcc, and you’d really rather have direct access to the nvcc compile command line to have the most control over observational parameters (e.g. -Xptxas ) as well as control parameters (e.g. -maxrregcount)

Three what? Percent?

In as far as occupancy is limited by register usage, it would make sense for it to increase in discrete steps, as registers get allocated in the hardware in blocks, not one by one. The details of the allocation scheme differ by GPU architecture; you may be able to (partially) reverse-engineer these from NVIDIA’s occupancy calculator.

Thanks for the responses. I should not have typed that post in such a hurry. I had hoped the answer would be some simple thing but it seems that’s not the case.

To answer your questions:

  • This is on a K40c.
  • I'm referring to occupancy in terms of thread blocks per SM, as opposed to using a percentage. The reason for this is I would like to constrain the occupancy to a specific number of warps which in turn will restrict the amount of concurrency a single SM can provide. It make more sense to use warps/sm in this case as oppose to percent of warps active.
  • SHARED_MEM is the amount of shared memory available on the GPU, which is 49152 bytes in this case.

And to be clear, the way I’m looking at latency is plotting the start and end times of each thread block, and sorting by start time, so that I can get a visual look at how many threads are running. I suspect this is not the most robust way to calculate (max) occupancy. This is an example from Volkov’s work: https://imgur.com/a/ZzoKoGA. I’ll see what I can get from nvcc and the profiler.

I’ll get back to you after I rewrite it C++.

Edit: Oh and txbob, I meant to write that I wanted to contain occupancy to threads/sm, not threads/warp which of course makes no sense. Sorry about that.