OpenCL problem not executing kernel when there is no write to memory

Hi, I’m new to openCL programming and my goal is to execute gpu kernel without memory access (if possible).
I wrote a gpu kernel code which do some calculation and doesn’t read or write to main memory. (GPU is sharing main memory with CPU).

Below is the code.
group_result[global_addr] = f1; is writing result to main memory.
If I execute the code as it is, gpu kernel executes well (the utilization peaks).
However, if I comment the line the gpu utilization is 0.
I checked that the kernel function is queued and submitted but the execution time of the kernel is 0.

Is there any reason that this problem occurs?
It is not possible for gpu to only compute and not read or write to memory?

__kernel void gpu_workload_low(__global float8* data,
        __local float8* local_result, __global float8* group_result,
        int highBandwidth, int compute_step) {
    float8 f1, f2, f3, f4;
    float8 tmp;
    float divider;
    uint global_addr, local_addr;
    global_addr = get_global_id(0);
    local_addr = get_local_id(0);

    if(get_local_id(0) == 0) {
        f1 = (float8)(1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f) * global_addr/(global_addr + 10);
        f2 = (float8)(2.0f, 3.0f, 1.0f, 4.0f, 8.0f, 10.0f, 2.0f, 7.0f) * global_addr/(global_addr + 10);
        f3 = (float8)(9.0f, 6.0f, 3.0f, 4.0f, 5.0f, 2.0f, 9.0f, 8.0f) * global_addr/(global_addr + 10);
        f4 = (float8)(3.0f, 2.0f, 4.0f, 5.0f, 8.0f, 1.0f, 2.0f, 5.0f) * global_addr/(global_addr + 10);

        for(int i = 1; i <= compute_step; i++) {
            tmp = f1 * f2;
            tmp /= (f3);
            //tmp = f4;
            f1 += tmp / f1;
            f2 += tmp / f2;
            f3 += tmp / f3;
            f4 += tmp / f4;

            if(i % 100 == 0) {
                divider = 10;
                f1 /= divider;
                f2 /= divider;
                f3 /= divider;
                f4 /= divider; 

        group_result[global_addr] = f1;


Hi @babjinny,

Can you please share how are you calling this kernel function and which flags are you using for the cl_mem?

Best regards,
Robert Gutierrez,
Embedded Software Engineer

It’s possible that when you comment out last line that uses f1, compiler optimizes out rest of the code based on dead-code analysis and kernel actually does nothing. This is just a guess.
Have you tried commenting out any other part of the kernel code to see if you see appropriate GPU utilization?