ulong kernel parameters not working

So I’ve made a very simple test kernel that simply copies input to output:

typedef uint4 vec_t;

__kernel

void SimpleKernel(constant vec_t* input, global vec_t* output, uint size)

{

    uint i = 0;

    for(; i < size; i++)

	output[i] = input[i];

}

To test it I initialize input to { 0, 1, 2, … n - 1 } (where each element is of type uint, n % 4 == 0, and size = n / 4), and it behaves exactly as expected; the output is { 0, 1, 2, … n - 1 }.

However, if I change vec_t to ulong4 instead of uint4 (or any other vector size), and change the rest of the types accordingly, I get the following output:

{ 0, 0, 2, 2, 4, 4, … n - 2, n - 2 }

The output is the same regardless of the vector width for vec_t (i.e. vec_t can be type ulong, ulong2, ulong4, etc), and regardless of whether the host program is built as 32 or 64 bit.

I have tested this same kernel on an ATI Radeon 4870 and it works exactly as expected, so the problem is probably with NVIDIA’s OpenCL implementation. Before you ask: yes, I am changing the input types in the host program whenever I change them in the kernel, and yes, I am using _aligned_malloc() for input.

Here are some specs about my machine:

Windows 7 Pro 64 bit

MSVS 2008 Pro

OPENCL SDK ver. 3.2.12_win_64

NVIDIA Driver ver. 260.93_desktop_win7_winvista_64bit_international

NVIDIA QUADRO FX 3800

ATI RADEON 4870

INTEL Core2 Quad Q9400

I have both the NVIDIA and ATI OpenCL implementations installed, however I have been careful to keep them separate so they shouldn’t be interfering with each other.

Does anyone else have this issue? Is it a known bug or am I simply lucky/stupid? I’d appreciate it if someone else could test this kernel and see if they get the same results.

So I’ve made a very simple test kernel that simply copies input to output:

typedef uint4 vec_t;

__kernel

void SimpleKernel(constant vec_t* input, global vec_t* output, uint size)

{

    uint i = 0;

    for(; i < size; i++)

	output[i] = input[i];

}

To test it I initialize input to { 0, 1, 2, … n - 1 } (where each element is of type uint, n % 4 == 0, and size = n / 4), and it behaves exactly as expected; the output is { 0, 1, 2, … n - 1 }.

However, if I change vec_t to ulong4 instead of uint4 (or any other vector size), and change the rest of the types accordingly, I get the following output:

{ 0, 0, 2, 2, 4, 4, … n - 2, n - 2 }

The output is the same regardless of the vector width for vec_t (i.e. vec_t can be type ulong, ulong2, ulong4, etc), and regardless of whether the host program is built as 32 or 64 bit.

I have tested this same kernel on an ATI Radeon 4870 and it works exactly as expected, so the problem is probably with NVIDIA’s OpenCL implementation. Before you ask: yes, I am changing the input types in the host program whenever I change them in the kernel, and yes, I am using _aligned_malloc() for input.

Here are some specs about my machine:

Windows 7 Pro 64 bit

MSVS 2008 Pro

OPENCL SDK ver. 3.2.12_win_64

NVIDIA Driver ver. 260.93_desktop_win7_winvista_64bit_international

NVIDIA QUADRO FX 3800

ATI RADEON 4870

INTEL Core2 Quad Q9400

I have both the NVIDIA and ATI OpenCL implementations installed, however I have been careful to keep them separate so they shouldn’t be interfering with each other.

Does anyone else have this issue? Is it a known bug or am I simply lucky/stupid? I’d appreciate it if someone else could test this kernel and see if they get the same results.

I haven’t tested your code yet, but the kernel parameters looks weird. I suppose it should look like

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

Maybe you are copying between different memory spaces… or is it just mistyped?

I haven’t tested your code yet, but the kernel parameters looks weird. I suppose it should look like

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

Maybe you are copying between different memory spaces… or is it just mistyped?

If you read the OpenCL language spec, you’ll see that constant and global are synonyms for __constant and __global respectively. That said, I tried your version of the function header and it fixes the bug! The thing that makes a difference is declaring input as “const global vec_t* input” instead of “constant vec_t* input”]. So this:

__kernel void SimpleKernel(__constant vec_t* input, __global vec_t* output, uint size)

produces the strange output described above (but only for ulong vectors) whereas this:

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

works fine. I guess my question now becomes, why doesn’t the __constant address space qualifier work as expected?

Also, there’s a mistake in my original post, when vec_t is of type ulong the output is correct, it’s only different when vec_t is of type ulongn.

If you read the OpenCL language spec, you’ll see that constant and global are synonyms for __constant and __global respectively. That said, I tried your version of the function header and it fixes the bug! The thing that makes a difference is declaring input as “const global vec_t* input” instead of “constant vec_t* input”]. So this:

__kernel void SimpleKernel(__constant vec_t* input, __global vec_t* output, uint size)

produces the strange output described above (but only for ulong vectors) whereas this:

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

works fine. I guess my question now becomes, why doesn’t the __constant address space qualifier work as expected?

Also, there’s a mistake in my original post, when vec_t is of type ulong the output is correct, it’s only different when vec_t is of type ulongn.

My mistake, you are right. const __global saves parameter only to global, __constant saves it into constant memory (I looked into generated PTX code)

However, when I tried to use typedef ulong4 vec_t attribute ((aligned)); the compiler received Segmentation fault.

My mistake, you are right. const __global saves parameter only to global, __constant saves it into constant memory (I looked into generated PTX code)

However, when I tried to use typedef ulong4 vec_t attribute ((aligned)); the compiler received Segmentation fault.

Just played around with __constant memory and found out that putting more data in it will result in incorrect behavior without any warring. Isn’t your case as well? Exceeding constant limits?

Just played around with __constant memory and found out that putting more data in it will result in incorrect behavior without any warring. Isn’t your case as well? Exceeding constant limits?

Hey,

I don’t think my problem is overflowing constant memory for a number of reasons. 1) The largest test I’ve done is with 256 ulongs = 2 KB, and my card has 1 GB VRAM, 2) if I change the type to ulong, or to uint and double the number of elements, it works fine. I’m inclined to conclude that this is a compiler/implementation bug, and to simply not use the constant keyword when working with ulong vectors, and to be cautious with it when working with anything more than 32 bits. It’s kind of annoying, but an acceptable workaround for now. Hopefully Nvidia will fix this in the near future.

As for your compiler error, try “typedef ulong4 attribute((aligned)) vec_t;” instead of “typedef ulong4 vec_t attribute ((aligned));”

Were you able to replicate my bug, or is it just me?

Hey,

I don’t think my problem is overflowing constant memory for a number of reasons. 1) The largest test I’ve done is with 256 ulongs = 2 KB, and my card has 1 GB VRAM, 2) if I change the type to ulong, or to uint and double the number of elements, it works fine. I’m inclined to conclude that this is a compiler/implementation bug, and to simply not use the constant keyword when working with ulong vectors, and to be cautious with it when working with anything more than 32 bits. It’s kind of annoying, but an acceptable workaround for now. Hopefully Nvidia will fix this in the near future.

As for your compiler error, try “typedef ulong4 attribute((aligned)) vec_t;” instead of “typedef ulong4 vec_t attribute ((aligned));”

Were you able to replicate my bug, or is it just me?

Sorry, the compiler fails with your suggested typedef as well so I can’t reproduce the error.

The constant memory is much more scarce resource than you think, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE says you exactly the size (for my GTX 275 it is 64KB although it has got 895MB of global memory).

Well, seems it is not the overflow. Are you familiar with CUDA? Try to compare PTX’s codes derived from OpenCL and CUDA equivalent kernel and it’ll be clear, whether it is a compiler’s failure.

Sorry, the compiler fails with your suggested typedef as well so I can’t reproduce the error.

The constant memory is much more scarce resource than you think, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE says you exactly the size (for my GTX 275 it is 64KB although it has got 895MB of global memory).

Well, seems it is not the overflow. Are you familiar with CUDA? Try to compare PTX’s codes derived from OpenCL and CUDA equivalent kernel and it’ll be clear, whether it is a compiler’s failure.