System reserved shared memory? How can it be possible?

Hi everybody!

I’m having problems with shared memory. I’m trying to multiply many small square matrices. In general here is the forumla for computing ONLY ONE result:

X1 = F(L0 * X0 * R0 + B0)

X2 = F(L1 * X1 * R1 + B1)

Result = F(Ln * Xn * Rn + Bn)

Every ‘L’, ‘X’, ‘R’, ‘B’ matrix dimensions are equal, square and have one of the following sizes: 4x4, 8x8, 16x16, 32x32.

F - is a non linear function that is applied to each element of a matrix

As you can see the computation of ‘Result’ is sequential, but there are many Results to compute.

I see this problem solved by putting as many matrices as possible to shared memory and to multiply them there.

I have come out with the following source:

[codebox]#include <stdio.h>

#include <cutil_inline.h>

#define StateRows 16

#define StateCols 16

#define LayersCount 4

#define SpiciesCount 256

#define SharedMemSize (16384)

#define SharedMemFloats (SharedMemSize / sizeof(float))

#define ThreadsPerBlock 512

#define SpiciesPerBlock (SharedMemSize / (2 * StateRows * StateCols * sizeof(float)))

#define Cast(Object, Type) ((Type)(Object))

typedef char Byte;

typedef struct


float s[SpiciesCount][StateRows][StateCols];

float l[LayersCount][SpiciesCount][StateRows][StateCols];

float r[LayersCount][SpiciesCount][StateRows][StateCols];

float b[LayersCount][SpiciesCount][StateRows][StateCols];

} Spicies;

typedef struct


float a[SpiciesPerBlock][StateRows][StateCols];

float b[SpiciesPerBlock][StateRows][StateCols];

} Multiplication;

global void VecAdd(void* globals)


__shared__ Byte SharedMem[SharedMemSize];

unsigned int localIndex = threadIdx.x;

unsigned int globalIndex = blockIdx.x * (sizeof(float[SpiciesPerBlock][StateRows][StateCols]) / sizeof(float));

while (localIndex < (sizeof(float[SpiciesPerBlock][StateRows][StateCols]) / sizeof(float)))


	Cast(Cast(SharedMem, Multiplication*)->a, float *)[localIndex] = Cast(Cast(globals, Spicies*)->s[globalIndex], float *)[localIndex];

	Cast(Cast(SharedMem, Multiplication*)->b, float *)[localIndex] = Cast(Cast(globals, Spicies*)->l[globalIndex], float *)[localIndex];


	localIndex += blockDim.x;



// for debug purposes

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i == 0)


	for (i = 0; i < SharedMemFloats; i++)


		Cast(globals, float *)[i] = Cast(SharedMem, float *)[i];




int main(int argc, char** argv)


float* h_data;

float* d_data;

int n = 32768;

size_t size = n * sizeof(float);

h_data = (float*)malloc(size);

// for (int i = 0; i < n; i++) h_data[i] = i;

cutilSafeCall(cudaMalloc((void**)&d_data, size));

cutilSafeCall(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));

int threadsPerBlock = 512;

int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_data);

cutilSafeCall(cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost));

for (int i = 0; i < n; i++)


	printf("%f ", h_data[i]);


if (h_data) free(h_data);

if (d_data) cudaFree(d_data);




which says “ptxas error : Entry function ‘_Z6VecAddPv’ uses too much shared data (0x4004 bytes + 0x10 bytes system, 0x4000 max)” it seems like there is some reserved data is being added to shared memory, how to get away with that?

#define SharedMemSize (16384)

__shared__ Byte SharedMem[SharedMemSize];

you use 16K byte shared memory, do you use GT200?

if you use Fermi, then you should compile with sm20

You don’t have a full 16K of shared memory. A few dozen bytes can be used by the system for things like threadIdx or kernel arguments. The exact amount is variable but you can query the attributes at runtime if it’s critical.

Fermi obviously can have up to 48K of shared memory.

Yes I’m trying to use 16k bytes of shared memory, and no, I’m not using Fermi. The problem is when the matrix size is 32x32, we have

32 rows * 32 cols * 4 bytes/float = 4096 bytes/matrix, and since we are multiplying two different matrices this means that we will need 8192 bytes per two matrices, which means that we can do 2 multiplications (in one block), and 16384 bytes of shared memory will be just enough for fitting these two multiplications.

So when the system uses at least one byte of shared memory the second multiplication won’t fit, and this will cause that we won’t be using around 8kb of shared memory.

According to the docs, up to compute 1.3, kernel parameters are passed via shared memory. Fermi should use constant memory for that so it may solve your situation

And if you don’t want to switch to Fermi, you can use tiling techniques. This will cause some memory accesses. But as long as these are not enough to make your kernel memory bound, it likely won’t affect speed much.

In your special case, as the results will immediately be reused (so it makes sense not to move them through global memory) and you are only about 16 or so bytes short per kernel, you can also put just one column or row of the matrix into registers instead of shared memory. This will make your code ugly and more difficult to read, but it would completely avoid reloading values from global memory.