Shared Memory Again What is happening Here

dlmeetei · June 4, 2009, 10:06am

Have written a small pgm,

global void resident_evil ( float *d_A, float d_B)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;

    __shared__ float smemA[BLOCK_SIZE];
   
            smemA[idx] = d_A[idx]; // This Copy is deliberately done so to produce below behaviour
            __syncthreads();

            smemA[idx] +=10;
            d_B[idx] = smemA[idx];

}

and called with
resident_evil <<< 5,192 >>> (dev_A, dev_B);
and The answer is

rslt[187]=197.000000
rslt[188]=198.000000
rslt[189]=199.000000
rslt[190]=200.000000
rslt[191]=201.000000
rslt[192]=202.000000
rslt[193]=203.000000
rslt[194]=204.000000
rslt[195]=205.000000
rslt[196]=206.000000
rslt[197]=207.000000
rslt[198]=208.000000
rslt[199]=209.000000
rslt[200]=210.000000
rslt[201]=211.000000
rslt[202]=10.000000
rslt[203]=10.000000
The Question is “How the underline parts came up?”, Why have not it stops at 191 position? You guys must be having some insight?

Sarnath · June 4, 2009, 10:34am

btw, 5*192 is big, no?

How much r u copying out?

Possibly, stale values in global memory populated by previous invocations…

Most likely that successive launch of same CUDA application allocate same memory addresses in the card(cudaMalloc) (when other CUDA apps r not running)…

Its always a good idea to “memset” host buffers, “cudaMemset” device buffers and copy correct sizes.

avidday · June 4, 2009, 10:42am

Surely the result of that kernel is going to depend completely on the initial values in d_A. You seem to be working on the assumption that it is zeroed. Are you sure about that?

dlmeetei · June 4, 2009, 10:44am

Yup, Number is not big… just for finding out something.
Copying is also wrong as mentioned as comment in the code ,( Done deliverately).
Stale vaues are there as u said.
smemA[idx] = d_A[idx]; // This Copy is deliberately done so to produce below behaviour
This is the statement for this behav…
But why this bahav…
Regards
Dev

dlmeetei · June 4, 2009, 10:52am

Oops Did I forget to mention. The input is 0,1,2,3,4… upto the limit.

ifÂ

int id = threadIdx.x;

smemA[id] = d_A[idx];

It will work as expected.Â

smem[idx] = d_A[idx];

What the compiler does in latter case. I dont know if U all got the my questions rightly. :o

avidday · June 4, 2009, 11:08am

How about you post the complete code of a minimal case that reproduces your problem, with your expected result and the actual result the code produces.

dlmeetei · June 4, 2009, 11:28am

[codebox]#define BLOCK_SIZE 192

#define BLOCK_NUM 5

#define DATA_SIZE ((BLOCK_SIZE)*(10))

global void trfm_floFFT(float *d_A, float *d_B)

{

Â Â Â Â Â Â int idx = blockIdx.x*blockDim.x + threadIdx.x;

Â Â Â Â Â Â shared float smemA[BLOCK_SIZE];

Â Â Â Â Â Â smemA[idx] = d_A[idx];

            Â  Â  Â  Â  Â  Â  __syncthreads();

Â Â Â Â Â Â smemA[idx] +=10;

Â Â Â Â Â Â d_B[idx] = smemA[idx];

}

int main() {

Â size_t mem_size = (DATA_SIZE * sizeof (float));

Â printf(“mem_size=%d\n”, mem_size);

Â //Input

Â float *inpt;

Â if ((inpt = (float*) malloc(mem_size)) == NULL) {

Â Â Â Â Â Â printf(“inpt: Malloc Failed\n”);

Â }

Â printf(“The INPUT array is\n”);

Â for (int i = 0; i < DATA_SIZE; i++) {

Â Â Â Â Â Â Â Â Â Â inpt[i] = (float) i;

Â Â Â Â Â Â Â Â Â Â printf(“inpt[%d]=%f\n”, i, inpt[i]);

Â }

Â // allocate device memory

Â float* dev_A;

Â CUDA_SAFE_CALL(cudaMalloc((void**) & dev_A, mem_size));

Â CUDA_SAFE_CALL(cudaMemcpy(dev_A, inpt, mem_size, cudaMemcpyHostToDevice));

Â float* dev_B;

Â CUDA_SAFE_CALL(cudaMalloc((void**) & dev_B, mem_size));

Â float *rslt_B;

Â // Space for result

Â if ((rslt_B = (float*) malloc(mem_size)) == NULL) {

Â Â Â Â Â Â Â Â Â printf(“rslt_B:Unable to Allocate Memory For\n”);

Â

Â }

Â //setup execution parameters

Â dim3 threads(BLOCK_SIZE,1,1);

Â dim3 grid(BLOCK_NUM,1,1);

Â trfm_floFFT <<< grid, threads >>> (dev_A, dev_B);

Â CUDA_SAFE_CALL(cudaMemcpy(rslt_B, dev_B, mem_size, cudaMemcpyDeviceToHost));

Â printf(“The Transformed Result:\n”);

Â for (int i = 0; i < DATA_SIZE; i++) {

Â printf(“rslt[%d]=%f\n”, i, rslt_B[i]);

Â }

Â printf(" \n");

Â // clean up memory

Â CUDA_SAFE_CALL(cudaFree(dev_A));

Â CUDA_SAFE_CALL(cudaFree(dev_B));

Â free(rslt_B);

}

[/codebox]

Here is the complete code. I was seeing a way to see resident threads.

Get input in an array Â and Input is increamented by 10; That’s it nothing else.

Sarnath · June 4, 2009, 11:32am

You are indexing shared memory with HUGE indices that are out of bounds

dlmeetei · June 4, 2009, 11:50am

Yes, That’s correc, My worry is that in the result from 192 to 201, The value doesnot come 10, Thereafter it comes. I also know that it is bcoz of the copied statement. Why and How it comes any idea,
Yea, Indexing is wrong just wanted know behaviour of Nvcc, nvcc cant detect this.
Ok, Thanks Sarnath Avidday got it.

avidday · June 4, 2009, 12:03pm

There is no way a C compiler can detect runtime out of bounds like that. If there was, there wouldn’t be such a thing as buffer overflows or runtime segmentation faults.

dlmeetei · June 5, 2009, 6:03am

What I got is following…
The min smem that nvcc reserved is 16(for simple variable) and 20 if pointer when parameter is 1. Thereafter it increases by 4 for a parameter.
In the last posted above, So extra 10 unsensible things got reduced by 1 each time u increased parameter list by 1.
This is a observation without a clue,

Sarnath · June 5, 2009, 7:05am

Forget about it. It is not documented what happens whenyou use “smem” indices outside what is allowed. Dont contemplate on it. waste of time.

Topic		Replies	Views
Shared memory, not being freed! Shared memory not cleard over blocks or runs!! CUDA Programming and Performance	16	3800	September 22, 2009
shared memory problem usage in variables CUDA Programming and Performance	8	2469	September 22, 2010
shared memory wrong allocation? CUDA Programming and Performance	2	844	July 29, 2009
Wierd thing in Shared Memory Looking for an explanation CUDA Programming and Performance	1	3503	January 28, 2011
Error on device emulator problem with __shared__ CUDA Programming and Performance	3	1925	February 23, 2009
Copying data into shared memory CUDA Programming and Performance	9	3761	July 1, 2009
help getting shared memory working CUDA Programming and Performance	11	4311	June 12, 2007
problem with __shared__ on device emulator CUDA Programming and Performance	1	3530	February 24, 2009
Ok, what am I doing wrong here? Some kind of overwrite? CUDA Programming and Performance	32	14610	February 7, 2008
Shared variable CUDA Programming and Performance	1	2321	January 29, 2009

Shared Memory Again What is happening Here

Related topics