Memory - where does it go? Help me understand why this doesnt work...

sonicx · May 7, 2010, 2:14pm

Hello!
im writing a bunch of tools using OpenCL. i am having the same problem with all of my kernels: random-ass-out-of-resource-errors!
my card has 128mb of allocatable memory. the buffers i use are usually pretty small. the datasets pretty large. so i run a kernel with a little buffer-data, some constants and privates in it a lot of times.
There seem to be a critical combination of buffersize (not even close to 128mb though) and the number of iterations done in a single kernel. Its not so much that it just doesnt work as i expect it to, its more that i don’t understand it.
Take look at this test-kernel:

__kernel void test()
{
float count = 10000000;
while(count–) {continue;}
return;
}
count of 10M works, count of 100M crashes - i would have excepted that even count = MAXFLOAT would be possible.

It doesnt even use any buffers, but increasing the iteration-depth will at some point result in “out-of-resources”. if i enable the unrolling extension, behaviour changes somehow, but still at some point: out-of-res. the more buffers i use, the deeper the iteration, the sooner the error.
I want to be able to make sure that this doesnt happen - preferably by knowing why it happens, and setting my depths and buffersizes according to the limits i can calculate from my gpu’s stats.
What i imagine the problem to be right now is something like: iterations get unrolled at some point and if a iteration is too deep, the memory it needs, grows to large.

Somebody please enlighten me whats going on here… How can i predict this behaviour?

195.36.15 on Quadro FX 1700

Lev · May 7, 2010, 4:37pm

You probably do not properly call kernell.

laughingrice · May 7, 2010, 7:36pm

Hello!

im writing a bunch of tools using OpenCL. i am having the same problem with all of my kernels: random-ass-out-of-resource-errors!

my card has 128mb of allocatable memory. the buffers i use are usually pretty small. the datasets pretty large. so i run a kernel with a little buffer-data, some constants and privates in it a lot of times.

There seem to be a critical combination of buffersize (not even close to 128mb though) and the number of iterations done in a single kernel. Its not so much that it just doesnt work as i expect it to, its more that i don’t understand it.

Take look at this test-kernel:

__kernel void test()

{
    float count = 10000000;

while(count--) {continue;}

return;
}

count of 10M works, count of 100M crashes - i would have excepted that even count = MAXFLOAT would be possible.

It doesnt even use any buffers, but increasing the iteration-depth will at some point result in “out-of-resources”. if i enable the unrolling extension, behaviour changes somehow, but still at some point: out-of-res. the more buffers i use, the deeper the iteration, the sooner the error.

I want to be able to make sure that this doesnt happen - preferably by knowing why it happens, and setting my depths and buffersizes according to the limits i can calculate from my gpu’s stats.

What i imagine the problem to be right now is something like: iterations get unrolled at some point and if a iteration is too deep, the memory it needs, grows to large.

Somebody please enlighten me whats going on here… How can i predict this behaviour?

195.36.15 on Quadro FX 1700

It’s because of the way floats work (23 bits significant digits). It just doesn’t go to 0. To show you what happens, output from matlab (same under cuda):

single(100000000) - single(1)

ans = 100000000

single(100000000) - single(2)

ans = 100000000

single(100000000) - single(3)

ans = 100000000

single(100000000) - single(4)

ans = 100000000

single(100000000) - single(5)

ans = 99999992

If you want to make sure that you reach 0 you should use integers. With floats what you should do is

__kernel void test()

{

    float count = 10000000;

while(count-- > 0) {continue;}

return;

}

AND make sure that you are in the relevant size range (numbers should be at most around 7-8 orders of magnitude difference for addition/subtraction, 10M = 1e7 is ok 100M = 1e8 is too much)

http://en.wikipedia.org/wiki/Floating_point

_Big_Mac · May 8, 2010, 3:14pm

BTW I don’t know how well the ocl compiler optimizes but if it was CUDA, the kernel would end up empty (dead code optimization, no write beyond local scope).

sonicx · June 14, 2010, 1:55pm

Just to round this one off: my problem was the runtime-limit which is active on my nvidia-gpu when having a graphical ui running on it besides my opencl-code. shut down X and all is well! you can “predict” that by checking your device for CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV.