allocatable size of shared memory

eelsen · February 20, 2007, 10:43pm

I’ve found that it is fairly unclear exactly how large a portion of the shared memory one can allocate. I’ve had some programs seems to get away with up to 3970 floats, and others seem to fail with significantly less. (Note that the failure is silent in release mode and the emulation modes work completely fine) Only in debug mode does the very helpful error, “unknown error” occur. I’m nearly 100% certain it is due to trying to allocate too much shared memory.

Is this because the registers for each fragment are also coming from the shared memory as well? So the more registers needed for the kernel (and/or number of threads) the less shared memory available?

Would it be possible to give us some way of knowing how much shared memory we can allocate?
Or give us a more useful error message than “unknown error” when a kernel fails for this reason?
Or make the emulation modes do some sort of check and warn you when this failure would probably happen when running on the device?

Thanks,
Erich

PS sorry about the multiple posts, the forums were acting a little wacky. Feel free to delete the first two.

eelsen · February 20, 2007, 11:12pm

As a followup, the maximum size appears to depend on more than just the actual requested size. For example if I request 1024 float2s that works - but if I ask for only 1000, that breaks. Is this expected? Can you explain this?

I don’t have a simple example I can throw at you right now (its integrated into a larger app), but I can break it out if necessary.

Thanks,
Erich

Mark_Harris · February 21, 2007, 11:40am

A simple example would be great, or use the NVIDIA bug report if you are a registered developer. For some (possibly not helpful for this issue) extra info on shared memory allocation, see my post on this thread:

http://forums.nvidia.com/index.php?showtop…ndpost&p=162304

Thanks,

Mark

eelsen · February 21, 2007, 7:05pm

I submitted a bug report under the corporate account for the company I’m working for.

But here is the simple kernel, just in case anyone can see what might be going on:

#define SIZE 1536

__global__ void

pass1Kernel(...)

{

  // shared memory

  __shared__  float2 sdata;

 // access thread id

  const unsigned int tid = threadIdx.x;

 // access number of threads in this block

  const unsigned int num_threads = blockDim.x;

  const unsigned int bid         = blockIdx.x;

 float2 temp[8];

 int cp = bid * num_threads + tid; //calculate absolute position

//fill up shared memory

  for (int i = 0; i < (int)(SIZE/num_threads);i++) {

       sdata[i*num_threads + tid] = g_LUT[LUT_section * SIZE + i * num_threads + tid ];

  }

  int i = (SIZE % num_threads);

//in case num_threads doesn't evenly divide SIZE

  if (tid < i) {

       //CODE FAILS HERE!!!!!!!!!!!!!!

       sdata = g_LUT[ (LUT_section + 1)*SIZE - tid - 1 ]; 

       //code fails on the above line when it is executed (SIZE % num_threads != 0)

  }

 __syncthreads();

//Do some work which has been mostly removed in this test case

....

}

When SIZE is an even multiple of the number of threads, so that the condition (tid < i) is false,

then there is no problem. However when it is executed (if say num_threads = 512 and SIZE = 1535) then code exits with an “unknown error”. Emulation modes work fine.

jhanweck · March 13, 2007, 5:46pm

Erich,

I’ve encountered a similar problem with code of the following nature:

typedef struct _node

{

  float a;

  float b;

  float dummy;  // for odd alignment to prevent bank conflicts?

} NODE;

extern __shared__ NODE nodes[];

__global__ void kernel(NODE *g_nodes)

{

  ....

  nodes[tid].a = some_function(some_other_paramers);

  nodes[tid].b = another_function(more_parameters);

  ....

  g_nodes[tid] = nodes[tid];

}

Works fine in the emulator, but fails with “unknown error” on the card.

I know that it’s not caused by running out of shared memory, because if I restructure my code to use the same amount of memory, but access shared memory as an equivalent array of floats, it works:

extern __shared__ float shmem[];

#define nodes_a(i) shmem[i]

#define nodes_b(i) shmem[nElem + i]

__global__ void kernel(NODE *g_nodes)

{

  ....

  nodes_a(tid) = some_function(some_other_paramers);

  nodes_b(tid) = another_function(more_parameters);

  ....

  g_nodes[tid].a = nodes_a(tid);

  g_nodes[tid].b = nodes_b(tid);

}

I suspect maybe bank conflicts or a problem handling arrays of structures in shared memory – haven’t got through the PTX file yet… Any other ideas?