setting bits in shared memory

I am trying to set specific bits in cuda but seem to hit an issue with the following code

shared bool table[8192];

device void set_bit(unsigned int bit) {
unsigned offset = bit >> 3;
unsigned mask = 1 << (bit & 7);
table[offset & 13] |= mask;

Is there something wrong here, or is there an easier way to set bits in shared memory?


You are assuming that bool is actually a uchar (because on CPU it usually is). I haven’t checked that on the G80 hardware but I would give

__shared__ char table[8192];

a try to be sure ('coz if the compiler chooses an int you are out of shared mem).


Thanks for the suggestion. The sizeof(bool) is 1. I actually am start to think that it is not this code but some weird running out of resources (even though there is no compiler warning). Without the call to this routine it works, under emudebug it works, but with the call my system hangs. Adding some simple for loops to my code also hangs my system.

I found out the hard way in my own code that the compiler uses shared memory for things other than what you declare. Since you are declaring an array of 8kB already the problem is likely that you are going past the 8kB limit. Try declaring an array of half that size and see what happens.

As for the ‘for loop’ issue: Is it possible that the kernel takes longer than 5 seconds to run? If the GeForce card that you are running your code on is also attached to the monitor then a kernel that takes longer than 5 seconds to run will cause the computer to lock up. This happens in both Linux and Windows. The solution is to connect the monitor to a different video card and use the GeForce exclusively for CUDA. Another solution would be to break your calculation into multiple smaller pieces and to call the kernel multiple times from the host, cacheing any intermediate results in global memory. This should also solve your shared memory problem because you could use less shared memory and thus be less likely to run into the 8k maximum.

Hope this helps.

Actually what I did is, I reduced the variables on device functions (which were called two levels down). And that seems to do it (but no more room left). What is funny is that if you split the for loop into 2 for loops (each half the size), the machine hanges too, if you make the for loop the n in the for loop > 1, just adding one call to the code, it hangs. I used the shared memory because of speed considerations, but this stops me for making multiple kernels. I suspect caching the results (in global memory) is just going to kill my speed. The compute times are very small (as in milliseconds, so I don’t think I came anywhere close to the 5 second limit, and trying to add a second geforce card for display, does not seem easy. I think I have to remove the 8800, install the new card and then start from scratch).

Thanks for the help.