Effect of OpenCL kernel size in performance

I have two almost identical OpenCL kernels, running on Tesla K40m, with minor differences. One kernel performs 256 FMA operations, and the other one performs 512 operations, in the loop body.

Here is the general structure of the kernel:

__kernel void WGSXMAPIXLLXOPS8(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
        const int XGL = get_global_id(0);
        const int XGRid = get_group_id(0);
        const int XGRnum = get_num_groups(0);
        const int XLSize = get_local_size(0);
        const int XLid = get_local_id(0);
        // Just a private variable
        float MF = (float) XGL;
        float NF = (float) N;
        float PF = (float) P;
        float tempOutTotal = 0;

        // Start of a new level of for loop
        for (int lcdd = 0; lcdd < 2; lcdd++) {
                float temp1 = 1.0;
                temp1 = temp1 * MF + temp1;
                temp1 = temp1 * MF + temp1;
                ...
                temp1 = temp1 * MF + temp1;
                temp1 = temp1 * MF + temp1;
                GOut[XGL] = temp1;
        }

}

Now, I calculate the GFlops of these kernels by dividing the total number of floating point operations by the time it takes to finish it. I deploy 141076 number of work-item onto the GPU. For kernel with 256 FMA operations, I get around 1696.5 GFlops and for the kernel with 512 FMA operations, I get around 2043.74GFlops.

From my point of view, I have enough parallelism and I have enough operations in the kernel. Unless my assumption is wrong.

Now the question is: Why having more operations in the kernel improves the performance? I understand having parallelism does matter and also each kernel should do enough operations. So, what exactly cause this specific gap in the performance?

To be more specific, is there any relation between occupancy and the number and the type of operations a kernel does?

The reason is simple.
Your loop is mostly arithmetic using private memory (registers), doing 1 global memory write of the output.

Your 256 FMA operations kernel performs 256 FMA operations + 1 global write.
Your 512 FMA operations kernel performs 512 FMA operations + 1 global write.

The higher the ratio of FMA operations / memory access, the better GFLOPS you will get.
In some GPGPU books this ratio is referred to as the “Compute intensity” of your kernel.

Regards,
Tomer Gal, CTO at OpTeamizer

Thanks Tomer,

How intensive is every global write? I was thinking the total number of arithmetics is much much higher than memory operations, which makes memory overhead really small. Also, the occupancy of my kernel is high enough to hide the memory overhead for large number of work-items. If I consider what you said, even really small number of memory operations can affect what I see as the final performance. Is there anything wrong with the above reasoning I made?

Thanks,
Saman

Hi Saman,
For example of the intensity, let’s review the 1080Ti: NVIDIA GeForce 10 Series Graphics Cards

The memory bandwidth is 484 GB/s.
The arithmetic performance is 11.3 TFLOPS.

So, the arithmetic capability is x23 faster than the memory access.

By the way… why do you write to the same index inside your for loop?
GOut[XGL] = temp1;
If the code you pasted is correct, you can move this write to be after the loop ended.

Regards,
Tomer Gal, CTO at OpTeamizer

Alright, I see.

The code I have is just a synthetic piece of code to evaluate some performance metrics.
My first reasoning was, even having this small write to the memory will be much smaller than the total number of arithmetics. But seems like it is still a big matter, even for a large number of work-items.

I was not able to remove the last write to the memory since without that the whole code will be optimized out by the compiler. Even disabling optimizer won’t help. So I need some really small write to memory.