So I’m using CUDA 3 SDK, and in an otherwise kosher situation the call to “cl_int status = clCreateKernelsInProgram(program, 0, NULL, &numKernels),” status is -47 (CL_INVALID_KERNEL_DEFINITION).
The program/kernel in question is reduce0 from the SDK examples, with “#define T float” on top; it compiles with no problems.
Note that if I used clCreateKernel(program, “reduce0”, &status) instead of the other call, the kernel is created correctly.
The OpenCL doc does not even specify that clCreateKernelsInProgram can return this error code.
Any thoughts on what might be happening and thoughts on how to debug this problem would be appreciated.
I figured out the problem, and will share it in case someone else encounters it.
The issue was that the program was within a context created for two GPU devices using clCreateContextFromType, but compiled for only one of the devices using clBuildProgram. That later resulted in the weird error when using clCreateKernelsInProgram.
I had the same issue. It’s more specific than just this is indicating. When N devices are in a context (where N > 0) there needs to be N cl_device_id, N cl_command_queue, at least 1 program, and N*K cl_kernel handles (where K is the number of kernels in your .cl file). Each Kernel is compiled for a specific device. In my instance I had only K cl_kernel handles. When I build for 1 device, it works fine, when I build against 2 devices it fails with CL_INVALID_KERNEL_DEFINITION. This was happening because clGetDeviceIDs was returning 2 devices when I asked only for 1. I consider this a BUG in the NVidia OpenCL code, (since this can result in a buffer overrun).
This should be able to query the maximum number of devices in the context:
// NULL platform means default implementation
// 0 size indicates no array
// NULL is acceptable as pointer
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices);
However when I’m actively trying to get the requested number of device handles, do not overfill my buffer!