clSetEventCallback stops working after 4 callbacks

When I do clSetEventCallback with CL_COMPLETE on an event returned by clEnqueueNDRangeKernel the GPU stops generating callbacks after four iterations.

The code example below illustrates the problem. It works as expected on AMD and Intel GPUs but on Nvidia GPUs the program freezes after four iterations (no more callbacks are received). If I instead of calling clEnqueueNDRangeKernel do clEnqueueReadBuffer it works as expected on Nvidia GPUs as well.

If I flush the command queue just before or just after the I set the callback, all callbacks are received; however, the runtime is approximately 10x slower on Nvidia GPUs compared to GPUs from AMD or Intel.

Does anybody have an idea as to why I see this behavior? Any help will be appreciated.

I have seen this problem on all Nvidia GPU I have tested on Windows 10. Currently I use:

  Name:                                          NVIDIA RTX A4000
  Vendor:                                        NVIDIA Corporation
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                516.25
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 3.0 CUDA
#include <atomic>
#include <iostream>
#include <vector>

#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_TARGET_OPENCL_VERSION 120
#include <CL/opencl.hpp>

static constexpr auto clSource{ R"CLC(
    __kernel void copy(__global uchar *i, __global uchar *o)
    {
        const int gid = get_global_id(0);
        o[gid] = i[gid];
    }
    )CLC" };

int main()
{
    constexpr auto vendor = "NVIDIA";
    //constexpr auto vendor = "AMD";
    //constexpr auto vendor = "Intel";

    std::cout << "OpenCL Callback Test\n\n";

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);

    auto platform = find_if(begin(platforms), end(platforms), [](const auto& platform) { return platform.getInfo<CL_PLATFORM_NAME>().starts_with(vendor); });
    std::cout << "Selected platform: " << platform->getInfo<CL_PLATFORM_NAME>() << "\n";

    std::vector<cl::Device> devices;
    platform->getDevices(CL_DEVICE_TYPE_GPU, &devices);

    auto device = devices.front();
    std::cout << "Selected device: " << device.getInfo<CL_DEVICE_NAME>() << "\n\n";

    std::vector<cl::Device> contextDevices{ device };
    cl_context_properties props[3]{ CL_CONTEXT_PLATFORM, (cl_context_properties)(*platform)(), 0 };
    cl::Context context(contextDevices, props);

    cl::CommandQueue queue(context, device/*, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE*/);

    std::vector<std::byte> input(10);
    auto output = input;

    cl::Buffer inputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, input.size(), nullptr };
    cl::Buffer outputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, output.size(), nullptr };

    cl::Program program(context, clSource, true);
    cl::Kernel kernel(program, "copy");

    kernel.setArg(0, inputBuffer);
    kernel.setArg(1, outputBuffer);

    for (int i = 0; i < 10; i++)
    {
        cl::Event event;
        queue.enqueueNDRangeKernel(kernel, cl::NullRange, { input.size() }, cl::NullRange, nullptr, &event);
        //queue.enqueueReadBuffer(outputBuffer, CL_FALSE, 0, output.size(), output.data(), nullptr, &event); 

        std::atomic_bool flag{ false };

        event.setCallback(CL_COMPLETE, [](cl_event, cl_int, void* user_data)
            {
                auto& flag = *((std::atomic_bool*)user_data);
                flag = true;
                flag.notify_one();
            }, &flag);

        flag.wait(false);

        std::cout << "Completed loop " << i << std::endl;
    }

    std::cout << "The End!\n";
}

I’m curious as to what is the benefit to post a bunch of code but omit a couple includes that are necessary. It seems odd to me. Maybe you’re on windows and the microsoft compiler doesn’t need these, I don’t know.

I don’t have a C++20 machine handy, and some of this stuff (atomics) doesn’t seem necessary to focus on the issue. Perhaps I am wrong.

At any rate, I cooked up this test, loosely based on your example, and it seems to work for me:

$ cat t10.cpp
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_TARGET_OPENCL_VERSION 120
#include <CL/opencl.hpp>
#include <iostream>
#include <cstddef>
#include <cstdlib>

static constexpr auto clSource{ R"CLC(
    __kernel void copy(__global uchar *i, __global uchar *o)
    {
        const int gid = get_global_id(0);
        o[gid] = i[gid];
    }
    )CLC" };

int main(int argc, char *argv[])
{
    constexpr auto vendor = "NVIDIA";
    //constexpr auto vendor = "AMD";
    //constexpr auto vendor = "Intel";

    std::cout << "OpenCL Callback Test\n\n";

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);

    auto platform = platforms.begin();
//= find_if(begin(platforms), end(platforms), [](const auto& platform) { return platform.getInfo<CL_PLATFORM_NAME>().starts_with(vendor); });
    std::cout << "Selected platform: " << platform->getInfo<CL_PLATFORM_NAME>() << "\n";

    std::vector<cl::Device> devices;
    platform->getDevices(CL_DEVICE_TYPE_GPU, &devices);

    auto device = devices.front();
    std::cout << "Selected device: " << device.getInfo<CL_DEVICE_NAME>() << "\n\n";

    std::vector<cl::Device> contextDevices{ device };
    cl_context_properties props[3]{ CL_CONTEXT_PLATFORM, (cl_context_properties)(*platform)(), 0 };
    cl::Context context(contextDevices, props);

    cl::CommandQueue queue(context, device/*, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE*/);

    std::vector<std::byte> input(10);
    auto output = input;

    cl::Buffer inputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, input.size(), nullptr };
    cl::Buffer outputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, output.size(), nullptr };

    cl::Program program(context, clSource, true);
    cl::Kernel kernel(program, "copy");

    kernel.setArg(0, inputBuffer);
    kernel.setArg(1, outputBuffer);
    int count = 0;
    int loops = 10;
    if (argc > 1) loops = atoi(argv[1]);
    for (int i = 0; i < loops; i++)
    {
        cl::Event event;
        queue.enqueueNDRangeKernel(kernel, cl::NullRange, { input.size() }, cl::NullRange, nullptr, &event);
        //queue.enqueueReadBuffer(outputBuffer, CL_FALSE, 0, output.size(), output.data(), nullptr, &event);

        event.setCallback(CL_COMPLETE, [](cl_event, cl_int, void* user_data)
            {
               ((volatile int *)user_data)[0]++;
               std::cout << "callback: " << ((volatile int *)user_data)[0] << std::endl;
            }, &count);


        std::cout << "Completed loop " << i << std::endl;
    }
    queue.flush();
    queue.finish();
    std::cout << "The End!\n";
}
$ g++ t10.cpp -I/usr/local/cuda/include -I. -lOpenCL --std=c++17 -o t10
$ ./t10 20
OpenCL Callback Test

Selected platform: NVIDIA CUDA
Selected device: Tesla V100-PCIE-32GB

Completed loop 0
callback: 1
Completed loop 1
callback: 2
Completed loop 2
callback: 3
Completed loop 3
Completed loop 4
callback: 4
callback: 5
Completed loop 5
Completed loop 6
callback: 6
callback: 7
Completed loop 7
Completed loop 8
callback: 8
callback: 9
Completed loop 9
Completed loop 10
callback: 10
callback: 12
callback: 12
Completed loop 11
Completed loop 12
callback: 13
Completed loop 13
Completed loop 14
callback: 14
callback: 15
Completed loop 15
callback: 16
Completed loop 16
callback: 17
Completed loop 17
callback: 18
Completed loop 18
Completed loop 19
callback: 19
callback: 20
The End!
$

You’ll note one of the things I added at the end is explicit synchronization. I haven’t studied what you were doing with atomics closely, but its not obvious to me why you would expect the OpenCL issued work to complete asynchronously without an explicit sync on the command queue in question. I suspect that may be a key difference between my code and yours.

If I run it multiple times, I do occasionally see the “callback: 20” printout occur after “The End!” message, which I find a bit strange. However, the callback printout is occurring on a different thread than the main thread, and the ostream mechanics may be tripping me up here. So I might be making mistake in that respect.

I don’t see anything stranger than that.

Perhaps another important difference is that my callback has no particular waiting going on. Again I haven’t studied your atomic activity closely. But I will point out that explicitly or implicitly waiting on OpenCL activity in a callback is frowned on:

Callbacks must return promptly. The behavior of calling expensive system routines, OpenCL API calls to create contexts or command-queues, or blocking OpenCL operations from the following list below, in a callback is undefined.

Thanks for your feedback, it was a copy-paste error that made me omit the #includes. The code sample is now updated in the original post. More importantly, I also forgot to write that I have only tested on Windows 10.

Adding a flush and a finish after the loop doesn’t change the behavior on my end.

The code example is a distillation of a far more complex program. The loop indicates the overall structure of the original program.

I use the atomic variable for synchronization. I wait for the callback (i.e. the kernel execution to complete) before I enqueue a new kernel. It is on this wait the example code stalles after four iterations. I can flush the command queue before I start waiting, but this seriously damage my performance.

All the callback does is to flip an atomic variable, it could hardly be less work (when synchronization with another thread is needed), I would expect this to apply with the no-work-in-the-callback rule.

Sorry about the C++20 stuff. I have made a new version which compiles with C++14 (on Windows at least).

#include <condition_variable>
#include <iostream>
#include <mutex>
#include <vector>

#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_TARGET_OPENCL_VERSION 120
#include <CL/opencl.hpp>

static constexpr auto clSource{ R"CLC(
    __kernel void copy(__global uchar *i, __global uchar *o)
    {
        const int gid = get_global_id(0);
        o[gid] = i[gid];
    }
    )CLC" };

struct sync_object
{
    std::condition_variable c;
    std::mutex m;
    bool signal;
};

int main()
{
    constexpr auto vendor = "NVIDIA";
    //constexpr auto vendor = "AMD";
    //constexpr auto vendor = "Intel";

    std::cout << "OpenCL Callback Test\n\n";

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);

    auto platform = find_if(begin(platforms), end(platforms), [](const auto& platform) { return platform.getInfo<CL_PLATFORM_NAME>().substr(0,3) == std::string(vendor).substr(0,3); });
    std::cout << "Selected platform: " << platform->getInfo<CL_PLATFORM_NAME>() << "\n";

    std::vector<cl::Device> devices;
    platform->getDevices(CL_DEVICE_TYPE_GPU, &devices);

    auto device = devices.front();
    std::cout << "Selected device: " << device.getInfo<CL_DEVICE_NAME>() << "\n\n";

    std::vector<cl::Device> contextDevices{ device };
    cl_context_properties props[3]{ CL_CONTEXT_PLATFORM, (cl_context_properties)(*platform)(), 0 };
    cl::Context context(contextDevices, props);

    cl::CommandQueue queue(context, device/*, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE*/);

    std::vector<unsigned char> input(10);
    auto output = input;

    cl::Buffer inputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, input.size(), nullptr };
    cl::Buffer outputBuffer{ context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, output.size(), nullptr };

    cl::Program program(context, clSource, true);
    cl::Kernel kernel(program, "copy");

    kernel.setArg(0, inputBuffer);
    kernel.setArg(1, outputBuffer);

    sync_object flag;
    flag.signal = false;

    for (int i = 0; i < 10; i++)
    {
        cl::Event event;
        queue.enqueueNDRangeKernel(kernel, cl::NullRange, { input.size() }, cl::NullRange, nullptr, &event);
        //queue.enqueueReadBuffer(outputBuffer, CL_FALSE, 0, output.size(), output.data(), nullptr, &event); 

        event.setCallback(CL_COMPLETE, [](cl_event, cl_int, void* user_data)
            {
                auto& flag = *((sync_object*)user_data);
                auto lock = std::unique_lock<std::mutex>(flag.m);
                flag.signal = true;
                flag.c.notify_one();
            }, &flag);

        auto lock = std::unique_lock<std::mutex>(flag.m);
        flag.c.wait(lock, [&flag]() {return flag.signal; });
        flag.signal = false;

        std::cout << "Completed loop " << i << std::endl;
    }

    std::cout << "The End!\n";
}

I still can’t compile your code, on linux with g++.

I think one possibility is you may be running into WDDM batching. I would suggest finding a way to keep pushing the work queue forward, rather than having your implicit wait on each event before issuing more work. This is also a difference between your realization and my realization.

For example, rather than testing for event completion in this implicit fashion, use an explicit event test (e.g. clGetEventInfo or whatever is the equivalent in the C++ wrapper).