Shared Memory Again What is happening Here

Have written a small pgm,

global void resident_evil ( float *d_A, float d_B)
int idx = blockIdx.x
blockDim.x + threadIdx.x;

    __shared__ float smemA[BLOCK_SIZE];
            smemA[idx] = d_A[idx]; // This Copy is deliberately done so to produce below behaviour

            smemA[idx] +=10;
            d_B[idx] = smemA[idx];


and called with
resident_evil <<< 5,192 >>> (dev_A, dev_B);
and The answer is


The Question is “How the underline parts came up?”, Why have not it stops at 191 position? You guys must be having some insight?

btw, 5*192 is big, no?

How much r u copying out?

Possibly, stale values in global memory populated by previous invocations…

Most likely that successive launch of same CUDA application allocate same memory addresses in the card(cudaMalloc) (when other CUDA apps r not running)…

Its always a good idea to “memset” host buffers, “cudaMemset” device buffers and copy correct sizes.

Surely the result of that kernel is going to depend completely on the initial values in d_A. You seem to be working on the assumption that it is zeroed. Are you sure about that?

Yup, Number is not big… just for finding out something.
Copying is also wrong as mentioned as comment in the code ,( Done deliverately).
Stale vaues are there as u said.
smemA[idx] = d_A[idx]; // This Copy is deliberately done so to produce below behaviour
This is the statement for this behav…
But why this bahav…

Oops Did I forget to mention. The input is 0,1,2,3,4… upto the limit.


int id = threadIdx.x;

smemA[id] = d_A[idx];

It will work as expected.Â

smem[idx] = d_A[idx];

What the compiler does in latter case. I dont know if U all got the my questions rightly. :o

How about you post the complete code of a minimal case that reproduces your problem, with your expected result and the actual result the code produces.

[codebox]#define BLOCK_SIZE 192

#define BLOCK_NUM 5

#define DATA_SIZE ((BLOCK_SIZE)*(10))

global void trfm_floFFT(float *d_A, float *d_B)


      int idx = blockIdx.x*blockDim.x + threadIdx.x;

      shared float smemA[BLOCK_SIZE];

      smemA[idx] = d_A[idx];

            Â  Â  Â  Â  Â  Â  __syncthreads();

      smemA[idx] +=10;

      d_B[idx] = smemA[idx];


int main() {

 size_t mem_size = (DATA_SIZE * sizeof (float));

 printf(“mem_size=%d\n”, mem_size);


 float *inpt;

 if ((inpt = (float*) malloc(mem_size)) == NULL) {

      printf(“inpt: Malloc Failed\n”);


 printf(“The INPUT array is\n”);

 for (int i = 0; i < DATA_SIZE; i++) {

          inpt[i] = (float) i;

          printf(“inpt[%d]=%f\n”, i, inpt[i]);


 // allocate device memory

 float* dev_A;

 CUDA_SAFE_CALL(cudaMalloc((void**) & dev_A, mem_size));

 CUDA_SAFE_CALL(cudaMemcpy(dev_A, inpt, mem_size, cudaMemcpyHostToDevice));

 float* dev_B;

 CUDA_SAFE_CALL(cudaMalloc((void**) & dev_B, mem_size));

 float *rslt_B;

 // Space for result

 if ((rslt_B = (float*) malloc(mem_size)) == NULL) {

         printf(“rslt_B:Unable to Allocate Memory For\n”);



 //setup execution parameters

 dim3 threads(BLOCK_SIZE,1,1);

 dim3 grid(BLOCK_NUM,1,1);

 trfm_floFFT <<< grid, threads >>> (dev_A, dev_B);

 CUDA_SAFE_CALL(cudaMemcpy(rslt_B, dev_B, mem_size, cudaMemcpyDeviceToHost));

 printf(“The Transformed Result:\n”);

 for (int i = 0; i < DATA_SIZE; i++) {

 printf(“rslt[%d]=%f\n”, i, rslt_B[i]);


 printf(" \n");

 // clean up memory






Here is the complete code. I was seeing a way to see resident threads.

Get input in an array  and Input is increamented by 10; That’s it nothing else.

You are indexing shared memory with HUGE indices that are out of bounds

Yes, That’s correc, My worry is that in the result from 192 to 201, The value doesnot come 10, Thereafter it comes. I also know that it is bcoz of the copied statement. Why and How it comes any idea,
Yea, Indexing is wrong just wanted know behaviour of Nvcc, nvcc cant detect this.
Ok, Thanks Sarnath Avidday got it.

There is no way a C compiler can detect runtime out of bounds like that. If there was, there wouldn’t be such a thing as buffer overflows or runtime segmentation faults.

What I got is following…
The min smem that nvcc reserved is 16(for simple variable) and 20 if pointer when parameter is 1. Thereafter it increases by 4 for a parameter.
In the last posted above, So extra 10 unsensible things got reduced by 1 each time u increased parameter list by 1.
This is a observation without a clue,

Forget about it. It is not documented what happens whenyou use “smem” indices outside what is allowed. Dont contemplate on it. waste of time.