Strange Behavior of shared memory

Hi all,

I am using the share memory to load two different datatypes one is of int and other is of user defined (say MyType) datatype of size 32 Bytes. Now i load by data of type int around 5kbytes into shared memory first. Now this data I dont modify in later. But the other datatype ie MyType I load in a loop and number of elements loaded is not fix but is less than number 256(num of threads in block). Now I write back all the values in the shared memory back to CPU but what i found that till loop count = 12 all the values are correct but after that values are not the same. Mostly values are zero or random. MY GPU is Geforce 335M with 72 cores and 16KB of shared memory per SM. I am not sure why this is happening if someone can help me with this i would really be grateful to him/her.

Following is a rough kernel code of what I am doing, I am calling my kernel with blockDim(16,16,1). Please help me with this I am stuck because of this.

__global__ void testKernel32Bytes(int * type1,Mytype * type2)

{

	int tx = blockIdx.x * blockDim.x + threadIdx.x;

	int ty = blockIdx.y * blockDim.y + threadIdx.y;

	__shared__ int shar_type1[36*36];

	

	__shared__ Mytype shar_type2[256];

	

	int threadNum =threadIdx.x + threadIdx.y * blockDim.x;

	for ( int i=0;i<6;i++) // each thread loads 6 ints into shared memory

	{

		if ( threadNum < 6*36 )

		{

		     shar_type1[6*threadNum + i] = type1[threadNum +i];// load some data from global memory into shared memory

		}

	}

	__syncthreads();

for ( int i=0;i<22;i++)

        {

               if ( threadNum <cnt[i])

	       {

	               shar_type2[threadNum] = type2[threadNum];

	       }

               __syncthreads();

               // I am writing back all the values of shared memory back after this step.

       }

}

Hi all,

I am using the share memory to load two different datatypes one is of int and other is of user defined (say MyType) datatype of size 32 Bytes. Now i load by data of type int around 5kbytes into shared memory first. Now this data I dont modify in later. But the other datatype ie MyType I load in a loop and number of elements loaded is not fix but is less than number 256(num of threads in block). Now I write back all the values in the shared memory back to CPU but what i found that till loop count = 12 all the values are correct but after that values are not the same. Mostly values are zero or random. MY GPU is Geforce 335M with 72 cores and 16KB of shared memory per SM. I am not sure why this is happening if someone can help me with this i would really be grateful to him/her.

Following is a rough kernel code of what I am doing, I am calling my kernel with blockDim(16,16,1). Please help me with this I am stuck because of this.

__global__ void testKernel32Bytes(int * type1,Mytype * type2)

{

	int tx = blockIdx.x * blockDim.x + threadIdx.x;

	int ty = blockIdx.y * blockDim.y + threadIdx.y;

	__shared__ int shar_type1[36*36];

	

	__shared__ Mytype shar_type2[256];

	

	int threadNum =threadIdx.x + threadIdx.y * blockDim.x;

	for ( int i=0;i<6;i++) // each thread loads 6 ints into shared memory

	{

		if ( threadNum < 6*36 )

		{

		     shar_type1[6*threadNum + i] = type1[threadNum +i];// load some data from global memory into shared memory

		}

	}

	__syncthreads();

for ( int i=0;i<22;i++)

        {

               if ( threadNum <cnt[i])

	       {

	               shar_type2[threadNum] = type2[threadNum];

	       }

               __syncthreads();

               // I am writing back all the values of shared memory back after this step.

       }

}

Some please help I am stuck at this point. cant figure out why this is happening?

Some please help I am stuck at this point. cant figure out why this is happening?

Hi, mohit30 –

I see your code, and on the surface it looks OK, but this line confuses me:

if ( threadNum <cnt[i])

Where is cnt[i] coming from? Is this a constant variable? Global? What is its contents?

Could you also post the code that writes back the shared memory?

Ben

Hi, mohit30 –

I see your code, and on the surface it looks OK, but this line confuses me:

if ( threadNum <cnt[i])

Where is cnt[i] coming from? Is this a constant variable? Global? What is its contents?

Could you also post the code that writes back the shared memory?

Ben

Dear BenW,

Thanks for replying :). Everything in the code which is not present in the function is in the constant memory. And cnt is just an array which lets me know how many elements of type2 i have to load from global memory into shared memory. In my code i use cnt[i] number of threads to load cnt[i] numbers of elements from global to shared memory. Maximum value of cnt[i] is 220 so the corruption should not occur because of that. I am resending the code with write back but this write back is just to test that i load everything correctly. And for each call i just check the output value of only for a single cnt to see if the loaded value is correct ( so i used if i==0 ( for first load). So what is currently happening is till i== 12 every data is correct ie both shar_type1 and shar_type2 values are same as what is in global memory but for i == 13 everything is incorrect. I am not sure why is this happening. so looking for some help here

__global__ void testKernel32Bytes(int * type1,Mytype * type2,Mytype * out)

{

        int tx = blockIdx.x * blockDim.x + threadIdx.x;

        int ty = blockIdx.y * blockDim.y + threadIdx.y;

__shared__ int shar_type1[36*36];

__shared__ Mytype shar_type2[256];

int threadNum =threadIdx.x + threadIdx.y * blockDim.x;

for ( int i=0;i<6;i++) // each thread loads 6 ints into shared memory

        {

                if ( threadNum < 6*36 )

                {

                     shar_type1[6*threadNum + i] = type1[threadNum +i];// load some data from global memory into shared memory

                }

        }

        __syncthreads();

for ( int i=0;i<22;i++)

        {

               if ( threadNum <cnt[i])

               {

                       shar_type2[threadNum] = type2[threadNum];

               }

               __syncthreads();

               // I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int i=0; i < cnt[i];i++) //writing values back to global memory

                         {

                             out[i]= shar_type2[i];

                         }

                    {

               }

}

}

Dear BenW,

Thanks for replying :). Everything in the code which is not present in the function is in the constant memory. And cnt is just an array which lets me know how many elements of type2 i have to load from global memory into shared memory. In my code i use cnt[i] number of threads to load cnt[i] numbers of elements from global to shared memory. Maximum value of cnt[i] is 220 so the corruption should not occur because of that. I am resending the code with write back but this write back is just to test that i load everything correctly. And for each call i just check the output value of only for a single cnt to see if the loaded value is correct ( so i used if i==0 ( for first load). So what is currently happening is till i== 12 every data is correct ie both shar_type1 and shar_type2 values are same as what is in global memory but for i == 13 everything is incorrect. I am not sure why is this happening. so looking for some help here

__global__ void testKernel32Bytes(int * type1,Mytype * type2,Mytype * out)

{

        int tx = blockIdx.x * blockDim.x + threadIdx.x;

        int ty = blockIdx.y * blockDim.y + threadIdx.y;

__shared__ int shar_type1[36*36];

__shared__ Mytype shar_type2[256];

int threadNum =threadIdx.x + threadIdx.y * blockDim.x;

for ( int i=0;i<6;i++) // each thread loads 6 ints into shared memory

        {

                if ( threadNum < 6*36 )

                {

                     shar_type1[6*threadNum + i] = type1[threadNum +i];// load some data from global memory into shared memory

                }

        }

        __syncthreads();

for ( int i=0;i<22;i++)

        {

               if ( threadNum <cnt[i])

               {

                       shar_type2[threadNum] = type2[threadNum];

               }

               __syncthreads();

               // I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int i=0; i < cnt[i];i++) //writing values back to global memory

                         {

                             out[i]= shar_type2[i];

                         }

                    {

               }

}

}

Hello,

Could it be because the datatype created by me is of the following type:

typedef struct{

{

     MyRect rect[3];

     float t1;

     float t2;

     float t3;

}MyType;

typedef struct {

    char x;

    char y;

    char w;

    char h;

    char a;

}MyRect;

Here the whole datatype is less than 32 bytes. But could those char values be the reason for what is happening?

How does the hardware copy data from global memory to shared memory? if its byte by byte than i think it shouldn't have created any problems but its its 4 bytes at a time then i am not sure how will it copy the following datatype ( ie MyType) from global to shared memory. Someone please let me know if you have any idea about this.

Hello,

Could it be because the datatype created by me is of the following type:

typedef struct{

{

     MyRect rect[3];

     float t1;

     float t2;

     float t3;

}MyType;

typedef struct {

    char x;

    char y;

    char w;

    char h;

    char a;

}MyRect;

Here the whole datatype is less than 32 bytes. But could those char values be the reason for what is happening?

How does the hardware copy data from global memory to shared memory? if its byte by byte than i think it shouldn't have created any problems but its its 4 bytes at a time then i am not sure how will it copy the following datatype ( ie MyType) from global to shared memory. Someone please let me know if you have any idea about this.

I concur with BenW’s question: what is the content of [font=“Courier New”]cnt[/font], element by element, and what is your intention using it? The loop condition in [font=“Courier New”]for ( int i=0; i < cnt[i];i++)[/font] looks highly suspicious and almost certainly is wrong.

Note that you have two different variables named [font=“Courier New”]i[/font], and the inner one will shadow the outer one. I guess what you wanted to write is

// I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int j=0; j < cnt[i];j++) //writing values back to global memory

                         {

                             out[j]= shar_type2[j];

                         }

                    {

               }

which would be the same as

// I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int i=0; i < cnt[0];i++) //writing values back to global memory

                         {

                             out[i]= shar_type2[i];

                         }

                    {

               }

I concur with BenW’s question: what is the content of [font=“Courier New”]cnt[/font], element by element, and what is your intention using it? The loop condition in [font=“Courier New”]for ( int i=0; i < cnt[i];i++)[/font] looks highly suspicious and almost certainly is wrong.

Note that you have two different variables named [font=“Courier New”]i[/font], and the inner one will shadow the outer one. I guess what you wanted to write is

// I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int j=0; j < cnt[i];j++) //writing values back to global memory

                         {

                             out[j]= shar_type2[j];

                         }

                    {

               }

which would be the same as

// I am writing back all the values of shared memory back after this step.

if ( threadIdx.x==0 && threadIdx.y==0 && blockIdx.x==0 && blockIdx.y ==0) // write the values back only for 1st block

               {

                    if( i ==0)

                    {

                         for ( int i=0; i < cnt[0];i++) //writing values back to global memory

                         {

                             out[i]= shar_type2[i];

                         }

                    {

               }

Dear Tera,
thanks for your reply.

  1. cnt is an array which stores how many elements of type2 that I have to read from the global memory to the shared memory. See I have an array of data (type2 ) which is huge and hence cannot be loaded into shared memory at one time. So I load only those sets elements which are needed at a particular time and use them and once they are used then i load the next set of elements which are required for further processing. Now the cnt which is an array of say 22 elements. For each index of cnt array , cnt gives me the number of elements that i need to load of type2 from global memory into shared memory. For example:
    cnt = {20,50,70,90…,215}, so it means I will first load only 20 elements of MyType into shared memory and use them, next I will load next 50 elements into shared memory and so on.
    I hope this will clarify the use of cnt

  2. Regarding writing back, its just for testing if the values loaded for each time is correct. In the code i have written it means i am checking for the first first cnt[0]= 20 elements. To check the next value I just use cnt[1], which will give me the next 50 elements. So the loop about which you are talking it is just to check the value of one segment of data. I am not checking the whole data at once so I writing only those values in which i am interested.

Hope this would have made things more clearer.
Regards
Mohit

Dear Tera,
thanks for your reply.

  1. cnt is an array which stores how many elements of type2 that I have to read from the global memory to the shared memory. See I have an array of data (type2 ) which is huge and hence cannot be loaded into shared memory at one time. So I load only those sets elements which are needed at a particular time and use them and once they are used then i load the next set of elements which are required for further processing. Now the cnt which is an array of say 22 elements. For each index of cnt array , cnt gives me the number of elements that i need to load of type2 from global memory into shared memory. For example:
    cnt = {20,50,70,90…,215}, so it means I will first load only 20 elements of MyType into shared memory and use them, next I will load next 50 elements into shared memory and so on.
    I hope this will clarify the use of cnt

  2. Regarding writing back, its just for testing if the values loaded for each time is correct. In the code i have written it means i am checking for the first first cnt[0]= 20 elements. To check the next value I just use cnt[1], which will give me the next 50 elements. So the loop about which you are talking it is just to check the value of one segment of data. I am not checking the whole data at once so I writing only those values in which i am interested.

Hope this would have made things more clearer.
Regards
Mohit

This is what I expected you were trying to achieve. Compare your code to the examples I gave to see why it is not what is actually happening.

This is what I expected you were trying to achieve. Compare your code to the examples I gave to see why it is not what is actually happening.

Dear tera,

Thanks for pointing out but that is not really the issue. Actually the two indexes are different. By mistake i used the same index here but in my actual kernel the two are different. Sorry for that little confusion because originally i didnt posted the code write as it was just for testing. And later I just added the new code but forgot to change the index.

But i have resolved that issue but i dont know the reason for it. So the MyType array which i used had char datatypes in it. But then I replaced all char by integers and the problem is resolved. But I still dont know what is the reason for it. Mostly it has to do with the reading char values from global to shared memory.

Now I have removed all char's and now all data is correct. If anyone can explain the reason for it I would be grateful.

Thanks a lot to everyone for replying to my questions.

/quote]

This is what I expected you were trying to achieve. Compare your code to the examples I gave to see why it is not what is actually happening.

[/quote]

Dear tera,

Thanks for pointing out but that is not really the issue. Actually the two indexes are different. By mistake i used the same index here but in my actual kernel the two are different. Sorry for that little confusion because originally i didnt posted the code write as it was just for testing. And later I just added the new code but forgot to change the index.

But i have resolved that issue but i dont know the reason for it. So the MyType array which i used had char datatypes in it. But then I replaced all char by integers and the problem is resolved. But I still dont know what is the reason for it. Mostly it has to do with the reading char values from global to shared memory.

Now I have removed all char's and now all data is correct. If anyone can explain the reason for it I would be grateful.

Thanks a lot to everyone for replying to my questions.

/quote]

This is what I expected you were trying to achieve. Compare your code to the examples I gave to see why it is not what is actually happening.

[/quote]