So I’ve made a very simple test kernel that simply copies input to output:
typedef uint4 vec_t;
__kernel
void SimpleKernel(constant vec_t* input, global vec_t* output, uint size)
{
uint i = 0;
for(; i < size; i++)
output[i] = input[i];
}
To test it I initialize input to { 0, 1, 2, … n - 1 } (where each element is of type uint, n % 4 == 0, and size = n / 4), and it behaves exactly as expected; the output is { 0, 1, 2, … n - 1 }.
However, if I change vec_t to ulong4 instead of uint4 (or any other vector size), and change the rest of the types accordingly, I get the following output:
{ 0, 0, 2, 2, 4, 4, … n - 2, n - 2 }
The output is the same regardless of the vector width for vec_t (i.e. vec_t can be type ulong, ulong2, ulong4, etc), and regardless of whether the host program is built as 32 or 64 bit.
I have tested this same kernel on an ATI Radeon 4870 and it works exactly as expected, so the problem is probably with NVIDIA’s OpenCL implementation. Before you ask: yes, I am changing the input types in the host program whenever I change them in the kernel, and yes, I am using _aligned_malloc() for input.
Here are some specs about my machine:
Windows 7 Pro 64 bit
MSVS 2008 Pro
OPENCL SDK ver. 3.2.12_win_64
NVIDIA Driver ver. 260.93_desktop_win7_winvista_64bit_international
NVIDIA QUADRO FX 3800
ATI RADEON 4870
INTEL Core2 Quad Q9400
I have both the NVIDIA and ATI OpenCL implementations installed, however I have been careful to keep them separate so they shouldn’t be interfering with each other.
Does anyone else have this issue? Is it a known bug or am I simply lucky/stupid? I’d appreciate it if someone else could test this kernel and see if they get the same results.