Should out of bounds shared memory accesses be errors?

Gregory_Diamos · April 4, 2009, 9:16am

So in the process of running the cuda SDK applications through my gpu emulator, I run into cases where programs declare shared memory arrays of a certain size with:

__global__ void random()

{

  extern __shared__ int memory[];

}

random<<<threads, blocks, size>>>();

And then end up writing more than size bytes to the array. Right now, my emulator treats these as errors (gpu equivalent of a segfault) even if they are within the shared memory size of the emulated GPU because they were not explicitly allocated. Is this the correct behavior, or is the shared memory size parameter given when launching a kernel more or less meaningless?

What do people think about this? Current gpus silently allow this, but it seems to me like it should be an error…

SPWorley · April 4, 2009, 1:44pm

So in the process of running the cuda SDK applications through my gpu emulator, I run into cases where programs declare shared memory arrays of a certain size with:
__global__ void random()

{

  extern __shared__ int memory[];

}

random<<<threads, blocks, size>>>();
And then end up writing more than size bytes to the array. Right now, my emulator treats these as errors (gpu equivalent of a segfault) even if they are within the shared memory size of the emulated GPU because they were not explicitly allocated. Is this the correct behavior, or is the shared memory size parameter given when launching a kernel more or less meaningless?

What do people think about this? Current gpus silently allow this, but it seems to me like it should be an error…

It’s likely best treated as you are, as a segfault. By design, the use of shared memory is abstracted as one chunk and you’re not allowed to go outside of that.

This allows the GPU to schedule multiple blocks on one multiprocessor and map their shared memory any way it wishes.

Perhaps CUDA doesn’t give immediate fatal errors just because it’d be expensive to detect them. But any write outside the block’s allocated shared memory is clearly a program bug and shouldn’t be ignored by an emulator.

In fact that’d be one of the biggest uses of an emulator, catching errors like an out of bounds write that the hardware might not immediately flag. One frequent problem of the CUDA emulator now is that it does not do any checking of memory space or range at all, so programs which accidentally mix device or shared or even host pointers still work, when they’ll die hard on the device. (a very frequent phrase here on the forums is “this code works in the emulator, but…” Emulators are most useful when they’re as strict and correct as possible.

Gregory_Diamos · April 5, 2009, 12:22pm

It’s likely best treated as you are, as a segfault. By design, the use of shared memory is abstracted as one chunk and you’re not allowed to go outside of that.

This allows the GPU to schedule multiple blocks on one multiprocessor and map their shared memory any way it wishes.

Perhaps CUDA doesn’t give immediate fatal errors just because it’d be expensive to detect them. But any write outside the block’s allocated shared memory is clearly a program bug and shouldn’t be ignored by an emulator.

In fact that’d be one of the biggest uses of an emulator, catching errors like an out of bounds write that the hardware might not immediately flag. One frequent problem of the CUDA emulator now is that it does not do any checking of memory space or range at all, so programs which accidentally mix device or shared or even host pointers still work, when they’ll die hard on the device. (a very frequent phrase here on the forums is “this code works in the emulator, but…” Emulators are most useful when they’re as strict and correct as possible.

Thanks, that was more or less my reaction. However, some of my group members also brought up the point that “bugs” that exist in production hardware over time evolve into “features” because a significant portion developers begin to rely on them, and fixing the bug becomes something akin to breaking backwards compatibility.

In this case, our regression test suite consists of the CUDA SDK and three other libraries. There are about 5 SDK examples that have this problem as well as a few of our internal libraries. One of our goals for this project was to be able to run the entire set of existing CUDA applications on our emulator, which either forces us to fix the bugs in the sdk or not support them in the final release…

MisterAnderson42 · April 5, 2009, 1:41pm

When compiled with --deviceemu, writing past the end or allocated shared memory like that usually results in a seg fault. That, or valgrind catches the problem. On the device, it may silently continue running, but now you’ve got some blocks writing into the shared memory space of others. This is definitely a program bug and will lead to undefined program execution. I would take the SDK programs that have the error and submit bug reports to the forum.

The fact that your emulator detects this is a good thing. If you do go forward with the release of your emulator, I look forward to using it to test HOOMD with it. This particular issue of writing past the end of allocated shared memory has cost me hours of debugging in the past (i.e., forgot to specify the shmem amount, or it needed a *sizeof(float4)).

Sarnath · April 6, 2009, 7:16am

@Greg,

Note that Shared memory is allocated in terms of 512 bytes… Check the CUDA occupancy caculator for exact details.

So the GPU hardware that checks out of bounds – might worry only about a boundary-crossing at 512 byte boundary…

BUT what you do in your emualtor is the RIGHT thing to do.

Gregory_Diamos · April 6, 2009, 1:34pm

Thanks for the responses. We will definitely leave this checking feature in and fix our regression tests and the sdk examples on our side. Also, we will probably end up with a similar check for accesses to global memory.

Topic		Replies	Views
allocatable size of shared memory CUDA Programming and Performance	4	8963	March 13, 2007
Global memory "leak" after register spill? CUDA Programming and Performance cuda , kernel	16	82	September 3, 2024
Help to understand the frame of CUDA programming CUDA Programming and Performance	2	1420	November 30, 2014
Invalid __global__ write of size 4. Need help with debugging CUDA Programming and Performance cuda	3	969	October 12, 2021
Conceptual questions on memory and blocks limit CUDA Programming and Performance	2	716	May 29, 2018
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37696	April 23, 2009
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3756	October 14, 2008
Shared memory, not being freed! Shared memory not cleard over blocks or runs!! CUDA Programming and Performance	16	3750	September 22, 2009
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9314	April 6, 2010
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3034	November 13, 2017

Should out of bounds shared memory accesses be errors?

Related topics