shared memory and device function

Hi,

I want to use shared memory to transfer several integer arrays between a kernel function and a device function.

There are 7 arrays each with 100 integer elements. I defined them as shared int A[100], B[100],…

In my kernel function

I coded

if (threadIdx.x == 0) {

        LOAD DATA FROM GLOBAL MEMORY ARRAY TO SHARED MEMORY ARRAYS A[], B[],...

}

	__syncthreads();

Then I pass these arrays to a device function device_sub(int *A, int *B, …)

My device function is within a loop.

The problem seems to be that the device function can be only called once. When it is called the second time within the loop, the kernel function returns me error “out of memory”

Does the code keep allocating resources when calling the device function? Otherwise, it should not exceed the share memory limit, which at least has 16K (or 48K) on my Tesla.

Thanks!

Hi,

if (threadIdx.x == 0)

makes me wonder if you are using a single thread to do all of the copying from global to shared. My appologies if there is a good reason for doing that.

But in case its just an oversight wouldn’t it be better to use 100 threads to copy from global to shared ? that will also take advantage of contiguous memory IO (faster) and I think eliminate the need for the function.

e.g. ( A is shared, d_A is global )

int loc = threadIdx.x*dimBlock.x + threadIdx.y;

  if ( loc < 100 )

  {

    A[loc] = d_A[loc];

    B[loc] = d_B[loc];

    ...

  }

  __syncthreads();

above is if you have 100 or more threads per block, if you are using less than 100 threads then use just the first 32 threads and have each thread copy 3 (or 4) cells of each array.