I've a question about CUDA Occuapncy Calculator by NVIDIA

Hi, I study CUDA architecture.

I made some of parallel processing code in environment like below.

  • GPU : GTX580
  • Threads Per Block : 16x16 = 256
  • Registers Per Thread : 16
  • Shared Memory Per Block : 48 bytes

I know the number of Registers and Shared Memory size by the compile option: --ptxas-options=-v
In addition, grid size is 32x32 = 1024 and there is not extra shared memory.

So, I tried to use CUDA_Occupancy_Calculator by NVIDIA.
then, It said,

3.) GPU Occupancy Data is displayed here and in the graphs:
Active Threads per Multiprocessor 1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor 100%

So, I run the application.
But, the result showed that the block size is 8x8 faster than 16x16.
I don’t know the why. Please help me.

Hello,

The occupancy calculator shows the maximum theoretical occupancy. In practice you get less because of limited bandwith when acccessing memory, non-contigous memory calls, latenceis. In order to optimize the code you need to make use of both thread parallelism and also instruction parallelism. In the case of the 2.1 architecture you need to put instructions one after each other which are independent. This will make your code faster because those instructions can be calculated in parallel. Also in the case of lower amount of threads there can be more blocks/warps/threads active on a smp which can hide latencies. More can be said once you put the code. So far I only know about one application which achieves the 100 % occupancy, the addition/multiplication C[i]=A[i]+B[i]; with i<N.

In my experience, occupancy is only loosely correlated to performance, and it is certainly possible to come across cases where code with lower occupancy performs better than code with somewhat higher occupancy. On the other hand, there are also cases where performance increases with occupancy until occupancy reaches 1.0. There is a lower limit on the number of threads necessary to cover the basic latencies inside the GPU, so I usually try to achieve at least an occupancy of 0.33 or thereabouts.

The issue of instruction level parallelism (ILP) mentioned by pasoleatis is separate from occupancy in my thinking, but increasing ILP can be beneficial on Kepler-family parts, although my experience with K20c is still too limited at this point to be able to quantify potential gains. My impression so far is that this is likely a minor concern for most codes, and that getting enough threads going to maximize data parallelism remains one of the major levers for CUDA performance.

I view instruction-level parallelism as mostly a compiler problem. I don’t know any strategies to increase ILP when writing CUDA C.

How much the compiler can increase ILP very much depends. It may be relatively easy for integer arithmetic, but hard for floating-point arithmetic. Two classical examples of increasing ILP at source level are the use Estrin’s method instead of Horner’s method to evaluate polynomials, and the splitting of a lengthy dot product into multiple partial dot products with a quick reduction at the end. In both cases the numerical results will generally differ between the version with low and high ILP, and only the programmer can decide whether that’s OK in the given context.

Due to tight restrictions on re-association in the evaluation of floating-point expressions, C compilers are unlikely to make those transformations (maybe some do when one turns on a “fast and furious” or “anything goes” mode), so programmers would typically apply the transformation at source level. This process should be familiar to anyone who has optimized for out-of-order CPUs that thrive on ILP. To my knowledge, Fortran gives compilers more leeway in re-arranging floating-point computations, so the likelihood of the above transformations happening autmatically is higher.

As I said, I have seen no evidence so far that watching out for ILP should be more than a very low priority for GPU programmers. But where it is trivially possible to increase ILP, why not do it? Here is one such trivial example I encountered recently, in the context of FMA-based interpolation between floating-point values stored in a 1D texture. The incoming data must be split into the integer portion for indexing into the texture, and a fractional part use in the interpolation:

float x, f;
int i;
#if LESS_ILP
i = (int)x;
f = x - (float)i;
#else // MORE_ILP
i = (int)x;
f = x - truncf(x);
#endif

Ah, I see! I forget how constrained the compiler is when dealing with floating point expressions…

I was just thinking about what seibert said in a previous post about how the warps are scheduled, especially the last sentence.

For some reason I assumed the OP had a 2.1 device.

The compute capability of GTX580 device is 2.0. So, I don’t concerned about problems on CC 2.1 hardware.

Here is a part of my kernel code.

__global__ void kernel_source( unsigned char *IMAGEin, int *MEMin, int dev_ID, int a, int b, int c){
        int i = blockIdx.x*TILE_WIDTH + threadIdx.x;
        int j = blockIdx.y*TILE_HEIGHT + threadIdx.y;

        float result_a, result_b;
        int temp[15];
        int k;
        
        for(k = 0; k < 5; k++){
                temp[k] = a*5+k;
                temp[k+5] = b*5+k;
                temp[k+10] = c*5+k;
        }       
        
        int result_a_up = ((MEMin[temp[11]]-MEMin[temp[1]])*(i-MEMin[temp[0]]))-((MEMin[temp[10]]-MEMin[temp[0]])*(j-MEMin[temp[1]]));
        int result_a_down = ((MEMin[temp[11]]-MEMin[temp[1]])*(MEMin[temp[5]]-MEMin[temp[0]]))-((MEMin[temp[6]]-MEMin[temp[1]])*(MEMin[temp[10]]-MEMin[temp[0]]));
        
        int result_b_up = ((MEMin[temp[6]] -MEMin[temp[1]])*(MEMin[temp[0]]-i))-((MEMin[temp[5]] -MEMin[temp[0]])*(MEMin[temp[1]]-j));
        int result_b_down = ((MEMin[temp[11]]-MEMin[temp[1]])*(MEMin[temp[5]]-MEMin[temp[0]]))-((MEMin[temp[6]]-MEMin[temp[1]])*(MEMin[temp[10]]-MEMin[temp[0]]));
        
        result_a = float(result_a_up) / float(result_a_down);
        result_b = float(result_b_up) / float(result_b_down);
        
        int isIn = (0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1));
        
        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-0)] += (int)((float(MEMin[temp[2]]) + (float(MEMin[temp[7]])-float(MEMin[temp[2]]))*result_a + (float(MEMin[temp[12]])-float(MEMin[temp[2]]))*result_b)) * isIn;
        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-1)] += (int)((float(MEMin[temp[3]]) + (float(MEMin[temp[8]])-float(MEMin[temp[3]]))*result_a + (float(MEMin[temp[13]])-float(MEMin[temp[3]]))*result_b)) * isIn;
        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-2)] += (int)((float(MEMin[temp[4]]) + (float(MEMin[temp[9]])-float(MEMin[temp[4]]))*result_a + (float(MEMin[temp[14]])-float(MEMin[temp[4]]))*result_b)) * isIn;
       
}

I tried to operate the code that erased last 3 lines.
Then, 16x16 is faster than 8x8.

I tried to operate the code that replaced to following code.
k = IMAGEin[(i*HEIGHTs+j)CHANNELS+(2-2)] + i;
l = IMAGEin[(i
HEIGHTs+j)CHANNELS+(2-1)] + i;
m = IMAGEin[(i
HEIGHTs+j)*CHANNELS+(2-0)] + i;
Then, 16x16 is faster than 8x8.

However, the next code shows different result.
IMAGEin[(i*HEIGHTs+j)CHANNELS+(2-2)] = i;
IMAGEin[(i
HEIGHTs+j)CHANNELS+(2-1)] = i;
IMAGEin[(i
HEIGHTs+j)*CHANNELS+(2-0)] = i;
This shows 8x8 is faster than 16x16.

I guess that the theoretical result is the different from the experimental result if it exists Write Operation on global memory: IMAGEin array. Otherwise, same as. Right this?

The conclusion is that in some cases performance is direct proportional to the occupancy, but in other cases low occupancy might lead to better performance. An example of this is fft and blas routines. You can find more about this if you look for the work of Volkov.

For your kernel your finding of reduced performance with higher occupancy is easily explained by the cache overflowing for higher occupancy.

The local array temp[] at full occupancy requires 1536×15×4=92160 bytes of cache, while at 33% occupancy (for the smaller 8×8 block size) only 512×15×4=30720 bytes are required per SM. With the larger 48kB cache/SM setting the latter could be fully cached eliminating off-chip memory accesses for temp[] almost completely, but even in the default 16kB cache/SM setting the cache hit probability is substantially higher.

As the temp[] array is not needed anyway, the fastest option (at either occupancy) would be to completely eliminate it. The compiler might already be able to achieve this if you just insert a #pragma unroll before the initialization loop. Otherwise replace all uses of temp[] with a little macro or inline function, or even just substitute the result into the code (which in this case I would even find more readable).

That makes sense, tera. But I couldn’t solve the problem using the way of your suggestion.

The key point is that access to global memory. I’ll study more.

Everybody, thank you very much!!

Where did you encounter problems implementing my suggestion?
Is your actual code more complicated than what you showed here?

Actually, The kernel is just that one and that’s all. But, I’ll take more detail.

void* routine( void *pvoidData ) {
        DataStruct  *data = (DataStruct*)pvoidData;
        unsigned char *dev_IMAGE;
        int *dev_MEM;
        unsigned char *IMAGE_SEG = data->IMAGE_SEG;

        HANDLE_ERROR(cudaSetDevice(data->deviceID));

        //initialize array
        memset(IMAGE_SEG, 0, WIDTH*HEIGHTs*CHANNELS);

        printf("Device %d Starting..\n", data->deviceID);

        //Evaluate Time
        cudaEvent_t start, stop;
        cudaEventCreate( &start );
        cudaEventCreate( &stop );

        HANDLE_ERROR( cudaMalloc( (void **)&dev_MEM, sizeof(int)*256) );                                //Creating int array each Block for same SIMD environment
        HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS) ); //output array

        cudaMemcpy(dev_MEM, MEM, sizeof(int)*256, cudaMemcpyHostToDevice);
        cudaMemset(dev_IMAGE, 0, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS);

        dim3    grid(WIDTH/TILE_WIDTH, HEIGHTs/TILE_HEIGHT);            //blocks in a grid
        dim3    block(TILE_WIDTH, TILE_HEIGHT);                         //threads in a block

        cudaEventRecord(start, 0);

        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 0, 1, 2);                    //Start the Kernel
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 0, 2, 3);                    //Start the Kernel
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 0, 3, 4);                    //Start the Kernel
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 0, 4, 5);                    //Start the Kernel
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 3, 2, 4);                    //Start the Kernel
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 2, 6, 4);                    //Start the Kernel

        cudaEventRecord(stop, 0);
        cudaEventSynchronize(stop);

        HANDLE_ERROR( cudaMemcpy( IMAGE_SEG, dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS, cudaMemcpyDeviceToHost ) );
        HANDLE_ERROR(cudaFree(dev_MEM));
        HANDLE_ERROR( cudaFree( dev_IMAGE ) );

        cudaEventElapsedTime( &elapsed_time_ms[data->deviceID], start, stop );          //Calculate elapsed time
        cudaEventDestroy(start);
        cudaEventDestroy(stop);

        printf("Algorithm Elapsed Time : %f ms(Device %d)\n", elapsed_time_ms[data->deviceID], data->deviceID);
        printf("Device %d Complete!\n", data->deviceID);
        
        return 0;
}

__global__ void PRINT_POLYGON( unsigned char *IMAGEin, int *MEMin, int dev_ID, int a, int b, int c)
{
        int i = blockIdx.x*TILE_WIDTH + threadIdx.x;
        int j = blockIdx.y*TILE_HEIGHT + threadIdx.y;

        float result_a, result_b;
//      int k;
//#pragma unroll 5
//      for(k = 0; k < 5; k++){
//              temp[k] = a*5+k;
//              temp[k+5] = b*5+k;
//              temp[k+10] = c*5+k;
//      }

        int result_a_up = ((MEMin[c*5+1]-MEMin[a*5+1])*(i-MEMin[a*5]))-((MEMin[c*5]-MEMin[a*5])*(j-MEMin[a*5+1]));
        int result_a_down = ((MEMin[c*5+1]-MEMin[a*5+1])*(MEMin[b*5]-MEMin[a*5]))-((MEMin[b*5+1]-MEMin[a*5+1])*(MEMin[c*5]-MEMin[a*5]));

        int result_b_up = ((MEMin[b*5+1] -MEMin[a*5+1])*(MEMin[a*5]-i))-((MEMin[b*5] -MEMin[a*5])*(MEMin[a*5+1]-j));
        int result_b_down = ((MEMin[c*5+1]-MEMin[a*5+1])*(MEMin[b*5]-MEMin[a*5]))-((MEMin[b*5+1]-MEMin[a*5+1])*(MEMin[c*5]-MEMin[a*5]));

        result_a = float(result_a_up) / float(result_a_down);
        result_b = float(result_b_up) / float(result_b_down);

        int isIn = (0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1));

        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-0)] += (int)(float(MEMin[a*5+2]) + (float(MEMin[b*5+2])-float(MEMin[a*5+2]))*result_a + (float(MEMin[c*5+2])-float(MEMin[a*5+2]))*result_b)*isIn;     //Red Channel
        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-1)] += (int)(float(MEMin[a*5+3]) + (float(MEMin[b*5+3])-float(MEMin[a*5+3]))*result_a + (float(MEMin[c*5+3])-float(MEMin[a*5+3]))*result_b)*isIn;     //Green Channel
        IMAGEin[(i*HEIGHTs+j)*CHANNELS+(2-2)] += (int)(float(MEMin[a*5+4]) + (float(MEMin[b*5+4])-float(MEMin[a*5+4]))*result_a + (float(MEMin[c*5+4])-float(MEMin[a*5+4]))*result_b)*isIn;     //Blue Channel maybe...

}

Oh, there is some mistakes.

The kernel function name is PRINT_POLYGON.
“void* routine(…);” function is a function before calling kernels.
I was tried to insert upper comments and plus “int temp[15];”