ulong kernel parameters not working

behuber · November 7, 2010, 11:43am

So I’ve made a very simple test kernel that simply copies input to output:

typedef uint4 vec_t;

__kernel

void SimpleKernel(constant vec_t* input, global vec_t* output, uint size)

{

    uint i = 0;

    for(; i < size; i++)

	output[i] = input[i];

}

To test it I initialize input to { 0, 1, 2, … n - 1 } (where each element is of type uint, n % 4 == 0, and size = n / 4), and it behaves exactly as expected; the output is { 0, 1, 2, … n - 1 }.

However, if I change vec_t to ulong4 instead of uint4 (or any other vector size), and change the rest of the types accordingly, I get the following output:

{ 0, 0, 2, 2, 4, 4, … n - 2, n - 2 }

The output is the same regardless of the vector width for vec_t (i.e. vec_t can be type ulong, ulong2, ulong4, etc), and regardless of whether the host program is built as 32 or 64 bit.

I have tested this same kernel on an ATI Radeon 4870 and it works exactly as expected, so the problem is probably with NVIDIA’s OpenCL implementation. Before you ask: yes, I am changing the input types in the host program whenever I change them in the kernel, and yes, I am using _aligned_malloc() for input.

Here are some specs about my machine:

Windows 7 Pro 64 bit

MSVS 2008 Pro

OPENCL SDK ver. 3.2.12_win_64

NVIDIA Driver ver. 260.93_desktop_win7_winvista_64bit_international

NVIDIA QUADRO FX 3800

ATI RADEON 4870

INTEL Core2 Quad Q9400

I have both the NVIDIA and ATI OpenCL implementations installed, however I have been careful to keep them separate so they shouldn’t be interfering with each other.

Does anyone else have this issue? Is it a known bug or am I simply lucky/stupid? I’d appreciate it if someone else could test this kernel and see if they get the same results.

behuber · November 7, 2010, 11:43am

So I’ve made a very simple test kernel that simply copies input to output:

typedef uint4 vec_t;

__kernel

void SimpleKernel(constant vec_t* input, global vec_t* output, uint size)

{

    uint i = 0;

    for(; i < size; i++)

	output[i] = input[i];

}

To test it I initialize input to { 0, 1, 2, … n - 1 } (where each element is of type uint, n % 4 == 0, and size = n / 4), and it behaves exactly as expected; the output is { 0, 1, 2, … n - 1 }.

However, if I change vec_t to ulong4 instead of uint4 (or any other vector size), and change the rest of the types accordingly, I get the following output:

{ 0, 0, 2, 2, 4, 4, … n - 2, n - 2 }

The output is the same regardless of the vector width for vec_t (i.e. vec_t can be type ulong, ulong2, ulong4, etc), and regardless of whether the host program is built as 32 or 64 bit.

I have tested this same kernel on an ATI Radeon 4870 and it works exactly as expected, so the problem is probably with NVIDIA’s OpenCL implementation. Before you ask: yes, I am changing the input types in the host program whenever I change them in the kernel, and yes, I am using _aligned_malloc() for input.

Here are some specs about my machine:

Windows 7 Pro 64 bit

MSVS 2008 Pro

OPENCL SDK ver. 3.2.12_win_64

NVIDIA Driver ver. 260.93_desktop_win7_winvista_64bit_international

NVIDIA QUADRO FX 3800

ATI RADEON 4870

INTEL Core2 Quad Q9400

I have both the NVIDIA and ATI OpenCL implementations installed, however I have been careful to keep them separate so they shouldn’t be interfering with each other.

Does anyone else have this issue? Is it a known bug or am I simply lucky/stupid? I’d appreciate it if someone else could test this kernel and see if they get the same results.

karbous · November 7, 2010, 12:08pm

I haven’t tested your code yet, but the kernel parameters looks weird. I suppose it should look like

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

Maybe you are copying between different memory spaces… or is it just mistyped?

karbous · November 7, 2010, 12:08pm

I haven’t tested your code yet, but the kernel parameters looks weird. I suppose it should look like

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

Maybe you are copying between different memory spaces… or is it just mistyped?

behuber · November 7, 2010, 3:37pm

If you read the OpenCL language spec, you’ll see that constant and global are synonyms for __constant and __global respectively. That said, I tried your version of the function header and it fixes the bug! The thing that makes a difference is declaring input as “const global vec_t* input” instead of “constant vec_t* input”]. So this:

__kernel void SimpleKernel(__constant vec_t* input, __global vec_t* output, uint size)

produces the strange output described above (but only for ulong vectors) whereas this:

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

works fine. I guess my question now becomes, why doesn’t the __constant address space qualifier work as expected?

Also, there’s a mistake in my original post, when vec_t is of type ulong the output is correct, it’s only different when vec_t is of type ulongn.

behuber · November 7, 2010, 3:37pm

If you read the OpenCL language spec, you’ll see that constant and global are synonyms for __constant and __global respectively. That said, I tried your version of the function header and it fixes the bug! The thing that makes a difference is declaring input as “const global vec_t* input” instead of “constant vec_t* input”]. So this:

__kernel void SimpleKernel(__constant vec_t* input, __global vec_t* output, uint size)

produces the strange output described above (but only for ulong vectors) whereas this:

__kernel void SimpleKernel(const __global vec_t* input, __global vec_t* output, uint size)

works fine. I guess my question now becomes, why doesn’t the __constant address space qualifier work as expected?

Also, there’s a mistake in my original post, when vec_t is of type ulong the output is correct, it’s only different when vec_t is of type ulongn.

karbous · November 7, 2010, 4:43pm

My mistake, you are right. const __global saves parameter only to global, __constant saves it into constant memory (I looked into generated PTX code)

However, when I tried to use typedef ulong4 vec_t attribute ((aligned)); the compiler received Segmentation fault.

karbous · November 7, 2010, 4:43pm

My mistake, you are right. const __global saves parameter only to global, __constant saves it into constant memory (I looked into generated PTX code)

However, when I tried to use typedef ulong4 vec_t attribute ((aligned)); the compiler received Segmentation fault.

karbous · November 7, 2010, 5:07pm

Just played around with __constant memory and found out that putting more data in it will result in incorrect behavior without any warring. Isn’t your case as well? Exceeding constant limits?

karbous · November 7, 2010, 5:07pm

Just played around with __constant memory and found out that putting more data in it will result in incorrect behavior without any warring. Isn’t your case as well? Exceeding constant limits?

behuber · November 7, 2010, 6:55pm

Hey,

I don’t think my problem is overflowing constant memory for a number of reasons. 1) The largest test I’ve done is with 256 ulongs = 2 KB, and my card has 1 GB VRAM, 2) if I change the type to ulong, or to uint and double the number of elements, it works fine. I’m inclined to conclude that this is a compiler/implementation bug, and to simply not use the constant keyword when working with ulong vectors, and to be cautious with it when working with anything more than 32 bits. It’s kind of annoying, but an acceptable workaround for now. Hopefully Nvidia will fix this in the near future.

As for your compiler error, try “typedef ulong4 attribute((aligned)) vec_t;” instead of “typedef ulong4 vec_t attribute ((aligned));”

Were you able to replicate my bug, or is it just me?

behuber · November 7, 2010, 6:55pm

Hey,

I don’t think my problem is overflowing constant memory for a number of reasons. 1) The largest test I’ve done is with 256 ulongs = 2 KB, and my card has 1 GB VRAM, 2) if I change the type to ulong, or to uint and double the number of elements, it works fine. I’m inclined to conclude that this is a compiler/implementation bug, and to simply not use the constant keyword when working with ulong vectors, and to be cautious with it when working with anything more than 32 bits. It’s kind of annoying, but an acceptable workaround for now. Hopefully Nvidia will fix this in the near future.

As for your compiler error, try “typedef ulong4 attribute((aligned)) vec_t;” instead of “typedef ulong4 vec_t attribute ((aligned));”

Were you able to replicate my bug, or is it just me?

karbous · November 8, 2010, 1:11pm

Sorry, the compiler fails with your suggested typedef as well so I can’t reproduce the error.

The constant memory is much more scarce resource than you think, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE says you exactly the size (for my GTX 275 it is 64KB although it has got 895MB of global memory).

Well, seems it is not the overflow. Are you familiar with CUDA? Try to compare PTX’s codes derived from OpenCL and CUDA equivalent kernel and it’ll be clear, whether it is a compiler’s failure.

karbous · November 8, 2010, 1:11pm

Sorry, the compiler fails with your suggested typedef as well so I can’t reproduce the error.

The constant memory is much more scarce resource than you think, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE says you exactly the size (for my GTX 275 it is 64KB although it has got 895MB of global memory).

Well, seems it is not the overflow. Are you familiar with CUDA? Try to compare PTX’s codes derived from OpenCL and CUDA equivalent kernel and it’ll be clear, whether it is a compiler’s failure.

Topic		Replies	Views
__constant memory issues CUDA Programming and Performance	32	12536	September 15, 2010
Undefined behavior or compiler bug CUDA NVCC Compiler	5	99	December 6, 2024
performance question CUDA Programming and Performance	9	9935	August 4, 2010
Bugs in OpenCL samples? CUDA Programming and Performance	3	3864	October 5, 2009
CL_INVALID_COMMAND_QUEUE, possible Nvidia bug? Writing to a single __local uint variable causes erro CUDA Programming and Performance	2	2582	July 7, 2010
What could be wrong with this kernel? code provided CUDA Programming and Performance	6	7826	February 1, 2010
OpenCL CL_OUT_OF_RESOURCES on event.wait() CUDA Programming and Performance	63	25103	November 19, 2010
strange behavior writing to global memory CUDA Programming and Performance	7	871	January 29, 2020
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	88	292	April 8, 2025
OpenCl very odd crash on compiling CUDA Programming and Performance	3	864	April 18, 2013

ulong kernel parameters not working

Related topics