strange error about shared memory

hakuna · November 30, 2007, 2:08am

I wrote a simple function about the pointer using in shared memory just like this:

in the main.cu

#define BLOCK 100

#define THREAD 256

void main(int argc, char **argv){

     unsigned int *h_data,*d_data;

     float *h_result;

     int dataSize = THREAD*sizeof(unsigned int);

     h_data = (unsigned int*)malloc(dataSize);

     for(int i=0;i<THREAD;i++)

         {

             h_data[i] = i+1;

          }

     h_result = (float*)malloc(THREAD*sizeof(float));

     CUDA_SAFE_CALL( cudaMalloc((void **)&d_data, dataSize);

     CUDA_SAFE_CALL(cudaMemcpy(h_ResultGPU,d_ResultGPU, dataSize, cudaMemcpyHostToDevice));

     mykernel<<<BLOCK, THREAD>>>(d_data, d_result, BLOCK*THREAD);

     ...........

}

and in the mykernel.cu, I use the shared memory pointer to operate data according to the Programming guide:

extern __shared__ char array[];

__global__  void mykernel(unsigned int *d_Data, float *d_Result, int dataN)

{

    unsigned int* s_Data  = (unsigned int *)array;

    float* s_Result = (float *)&s_Data[THREAD];

    const int tid = threadIdx.x;

    s_Data[tid] = d_Data[tid];

    __syncthreads();

    s_Result[tid] = (float)s_Data[tid]/(float)dataN;

    __syncthreads();

    d_Result[tid] = s_Result[tid];

}

I run this program but find that the returned result is very strange, it seems that the shared memory computing has overflowed, and the result print is always 0.000000, or a very huge number. then I add the shared memory restrict in the executing configure like that:

mykernel<<<BLOCK, THREAD, dataSize+sizeof(float)*THREAD>>>(d_data, d_result, BLOCK*THREAD);

but the result is still not correct. Also, I moved the

extern __shared__ char array[];

into the kernel function, but the error still exist.

So, can anyone tell me what’s wrong with my program, or how to use the shared memory pointer correctly?

thanks for any reply.

paulius · November 30, 2007, 4:34am

I don’t see the code that declares and allocates h_ResultGPU and d_ResultGPU, but if the names imply their location, your pointers are switched in your memcopy:

CUDA_SAFE_CALL(cudaMemcpy(h_ResultGPU,d_ResultGPU, dataSize, cudaMemcpyHostToDevice));

If all your computations are in floats, why do you pass ints? Also, why do you get the s_Result pointer through s_Data? Seems unnecessarily complicated.

All your threadblocks read and write the same locations. Meaning that threadblocks 1 through 99 will read/write exactly same locations as threadblock 0. So, if you’re checking results past the 256th element in the output, you’ll get whatever happened to be in memory before the launch.

Also, the snipped doesn’t show d_result being allocated.

Have you tried somthing simpler to use extern shared memory? Like simple read data into smem, then write the same data to the output. Minimize complexity when getting comfortable with an unfamiliar feature.

Paulius

AndreiB · November 30, 2007, 5:39am

You’re using extern shared declaration but not specifying shared memory size when calling kernel.
Execution configuration is defined as <<a,b,c>> where a and b define grid and block configuration and c defines number of shared memory bytes allocated for kernel.

Also, as Paulius noticed, all blocks of your threads perform very same operations. It doesn’t matter whether you run 1 block or 1000, result in device memory will be the same. While index in shared memory is typically depends on threadIdx, index into global memory (source and result) usually depends on blockIdx and threadIdx.

hakuna · November 30, 2007, 9:14am

I have tried to specify shared memory size, but the result still be wrong. and what I want to do is that: there is a common array d_data[THREAD], then I copy the data into shared memory, and use this array to compute a common divide result, which is stored in s_Result[THREAD], the two arrays are common for all threads in a block. then I make use of this two result array to operate the global data. just like that, first I compute a common data table, then all threads in a block use this table to deal with the global data. the sample code is a breviary of my code. I have been puzzled by this problem many days, but can not get any advancement.

paulius · November 30, 2007, 6:15pm

Do you still have the problem after fixing the pointers in cudaMemcpy? The first argument is destination.

Paulius