shared memory exact usable size 16kb less 256??

wlangdon · November 2, 2010, 3:43pm

Somewhere, unfortunately I cannot remember exactly where, there was a claim that the top nnn bytes
of shared memory are used by CUDA and cannot be used by my kernel. I remember nnn as 256 but that
was a while ago and I have switch to a newer version of CUDA since. Does this ring bells
with anyone? I am using 16kb less 768 bytes and odd things are happening in the top 2k.
Any help or suggestions would be most welcome
Bill

wlangdon · November 2, 2010, 3:43pm

Somewhere, unfortunately I cannot remember exactly where, there was a claim that the top nnn bytes
of shared memory are used by CUDA and cannot be used by my kernel. I remember nnn as 256 but that
was a while ago and I have switch to a newer version of CUDA since. Does this ring bells
with anyone? I am using 16kb less 768 bytes and odd things are happening in the top 2k.
Any help or suggestions would be most welcome
Bill

kbam · November 3, 2010, 12:25am

I believe it depends on the number of local variables your kernel needs as well
So if your kernel would need 20 registers but only has 16 then 4 variables (maybe more) get mapped into shared, per thread.
Also the number of parameters you pass to the kernel
However all that should be taken into account when the GPU decides how many blocks it can run per MP i.e. 1,2 or 3 (or even none)

Are you reading beyond the end of an array or variable ?

Regards,
kbam

kbam · November 3, 2010, 12:25am

I believe it depends on the number of local variables your kernel needs as well
So if your kernel would need 20 registers but only has 16 then 4 variables (maybe more) get mapped into shared, per thread.
Also the number of parameters you pass to the kernel
However all that should be taken into account when the GPU decides how many blocks it can run per MP i.e. 1,2 or 3 (or even none)

Are you reading beyond the end of an array or variable ?

Regards,
kbam

tera · November 3, 2010, 2:02am

Register spills go to local memory (i.e., off-chip), not shared memory.

If I remember the numbers correctly from the top of my head, shared memory is 16KB - 16 bytes (16368 bytes) on 1.x devices, minus space needed for parameters.

I agree though that the problem looks more like an out-of-bounds access, as the kernel should just refuse to run if available resources are insufficient. The one exception would be dynamically allocated shared memory with an incorrect amount of memory specified on kernel launch. Do you have external shared declarations in your kernel?

tera · November 3, 2010, 2:02am

Register spills go to local memory (i.e., off-chip), not shared memory.

If I remember the numbers correctly from the top of my head, shared memory is 16KB - 16 bytes (16368 bytes) on 1.x devices, minus space needed for parameters.

I agree though that the problem looks more like an out-of-bounds access, as the kernel should just refuse to run if available resources are insufficient. The one exception would be dynamically allocated shared memory with an incorrect amount of memory specified on kernel launch. Do you have external shared declarations in your kernel?

wlangdon · November 3, 2010, 9:48am

Dear kbam and tera,

Thank you for your helpful replies.

So it looks like having 768 bytes free is more than enough.

Yip, I agree out-of-bounds array access may be the problem.

I will double check this.

Yip I am using

extern __shared__ unsigned int shared_array[];

and

kernel<<<grid_size, block_size, shared_size>>>

.

This was recommended by earlier CUDA documentation.

Is it no longer the prefered approach?

Many thanks

Bill

wlangdon · November 3, 2010, 9:48am