Shared memory coherance problem within a thread

Hello there,

I am having a problem that I cannot fix so far. I am trying to use the shared memory as a local memory per thread. So each thread in a block has its own “private memory” within the Shared. Therefore, I dinamically allocate as much shared memory as all the threads in a block can use, and then I manage the index for each thread to access a different segments of shared memory.

extern __shared__ volatile unsigned int shared_k1[];

unsigned int sharedindex =tid*elemSharedPerThread;

if (index<total_number){ 

		volatile unsigned int * l = &shared_k1[0];	

		unsigned int cellShared, bitShared;		

		for (unsigned int i=0; i<elemSharedPerThread; i++){				

			l[sharedindex+i]=0;				

 	  	}

  				

		unsigned int c = d[index];

		unsigned int beg = c;					

		d[bindex]=beg;				

		cellShared = (unsigned int)beg/32; 	

		bitShared  = beg % 32;	

		

		volatile unsigned int aux = (unsigned int)MACROBITS>>bitShared;			

		l[sharedindex+cellShared]=aux;

The problem is that l does not store the value aux at the end of the computation (only some threads do that, and the rest remains with 0 value). Actually, if I remove volatile clause for the variable aux, then it happens the same for this variable. However it is not possible to obtain the result in shared memory even declaring it volatile.

I have also check by using syncthreads, but it is not working, becasue actually I don’ wanna use values among threads in a block, I only want to use values in the thread itself.

I guess it is a compiler problem, i am using the latest version release 3.2, V0.2.1221.

Any ideas?

Cheers

Hello there,

I am having a problem that I cannot fix so far. I am trying to use the shared memory as a local memory per thread. So each thread in a block has its own “private memory” within the Shared. Therefore, I dinamically allocate as much shared memory as all the threads in a block can use, and then I manage the index for each thread to access a different segments of shared memory.

extern __shared__ volatile unsigned int shared_k1[];

unsigned int sharedindex =tid*elemSharedPerThread;

if (index<total_number){ 

		volatile unsigned int * l = &shared_k1[0];	

		unsigned int cellShared, bitShared;		

		for (unsigned int i=0; i<elemSharedPerThread; i++){				

			l[sharedindex+i]=0;				

 	  	}

  				

		unsigned int c = d[index];

		unsigned int beg = c;					

		d[bindex]=beg;				

		cellShared = (unsigned int)beg/32; 	

		bitShared  = beg % 32;	

		

		volatile unsigned int aux = (unsigned int)MACROBITS>>bitShared;			

		l[sharedindex+cellShared]=aux;

The problem is that l does not store the value aux at the end of the computation (only some threads do that, and the rest remains with 0 value). Actually, if I remove volatile clause for the variable aux, then it happens the same for this variable. However it is not possible to obtain the result in shared memory even declaring it volatile.

I have also check by using syncthreads, but it is not working, becasue actually I don’ wanna use values among threads in a block, I only want to use values in the thread itself.

I guess it is a compiler problem, i am using the latest version release 3.2, V0.2.1221.

Any ideas?

Cheers

Ok Solved!

It wasn’t because there is nothing wrong above. It was just because I was calling the kernel without setting the shared memory parameter in the kernel launch. This are the consequences of using previous versions :).

What I can not really understand now is why the compiler has not complaint about this. If I was using Shared Memory definied by an extern variable, and it wasn’t set that parameter in the kernel launch. I thought previous compiler versions were complaining about this.

I’ve also noticed, that for a very small thread block (16 threads or so), if you are using shared memory and you exceed the 16 KB availables on T10, the application stills runs on, but much slower. However the compiler is complaining against so for biggest thread block sizes.

I was wondering if it has any relation with the flexibility in FEMRI to play with L1 cache and Shared Memory, because actually the driver is the same on both platforms though.

Cheers

Ok Solved!

It wasn’t because there is nothing wrong above. It was just because I was calling the kernel without setting the shared memory parameter in the kernel launch. This are the consequences of using previous versions :).

What I can not really understand now is why the compiler has not complaint about this. If I was using Shared Memory definied by an extern variable, and it wasn’t set that parameter in the kernel launch. I thought previous compiler versions were complaining about this.

I’ve also noticed, that for a very small thread block (16 threads or so), if you are using shared memory and you exceed the 16 KB availables on T10, the application stills runs on, but much slower. However the compiler is complaining against so for biggest thread block sizes.

I was wondering if it has any relation with the flexibility in FEMRI to play with L1 cache and Shared Memory, because actually the driver is the same on both platforms though.

Cheers