Should out of bounds shared memory accesses be errors?

So in the process of running the cuda SDK applications through my gpu emulator, I run into cases where programs declare shared memory arrays of a certain size with:

__global__ void random()

{

  extern __shared__ int memory[];

}

random<<<threads, blocks, size>>>();

And then end up writing more than size bytes to the array. Right now, my emulator treats these as errors (gpu equivalent of a segfault) even if they are within the shared memory size of the emulated GPU because they were not explicitly allocated. Is this the correct behavior, or is the shared memory size parameter given when launching a kernel more or less meaningless?

What do people think about this? Current gpus silently allow this, but it seems to me like it should be an error…

It’s likely best treated as you are, as a segfault. By design, the use of shared memory is abstracted as one chunk and you’re not allowed to go outside of that.

This allows the GPU to schedule multiple blocks on one multiprocessor and map their shared memory any way it wishes.

Perhaps CUDA doesn’t give immediate fatal errors just because it’d be expensive to detect them. But any write outside the block’s allocated shared memory is clearly a program bug and shouldn’t be ignored by an emulator.

In fact that’d be one of the biggest uses of an emulator, catching errors like an out of bounds write that the hardware might not immediately flag. One frequent problem of the CUDA emulator now is that it does not do any checking of memory space or range at all, so programs which accidentally mix device or shared or even host pointers still work, when they’ll die hard on the device. (a very frequent phrase here on the forums is “this code works in the emulator, but…” Emulators are most useful when they’re as strict and correct as possible.

Thanks, that was more or less my reaction. However, some of my group members also brought up the point that “bugs” that exist in production hardware over time evolve into “features” because a significant portion developers begin to rely on them, and fixing the bug becomes something akin to breaking backwards compatibility.

In this case, our regression test suite consists of the CUDA SDK and three other libraries. There are about 5 SDK examples that have this problem as well as a few of our internal libraries. One of our goals for this project was to be able to run the entire set of existing CUDA applications on our emulator, which either forces us to fix the bugs in the sdk or not support them in the final release…

When compiled with --deviceemu, writing past the end or allocated shared memory like that usually results in a seg fault. That, or valgrind catches the problem. On the device, it may silently continue running, but now you’ve got some blocks writing into the shared memory space of others. This is definitely a program bug and will lead to undefined program execution. I would take the SDK programs that have the error and submit bug reports to the forum.

The fact that your emulator detects this is a good thing. If you do go forward with the release of your emulator, I look forward to using it to test HOOMD with it. This particular issue of writing past the end of allocated shared memory has cost me hours of debugging in the past (i.e., forgot to specify the shmem amount, or it needed a *sizeof(float4)).

@Greg,

Note that Shared memory is allocated in terms of 512 bytes… Check the CUDA occupancy caculator for exact details.

So the GPU hardware that checks out of bounds – might worry only about a boundary-crossing at 512 byte boundary…

BUT what you do in your emualtor is the RIGHT thing to do.

Thanks for the responses. We will definitely leave this checking feature in and fix our regression tests and the sdk examples on our side. Also, we will probably end up with a similar check for accesses to global memory.