__syncthreads() limitation.. Help please..

Sarnath · December 30, 2008, 9:53am

All,

I have this kernel of mine that runs a big for loop like this:

for(index = function(threadIdx.x, blockIdx.x); index < function2(threadIDx.x, blockIdx.x); index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   cacheDataInSharedMemory(from Global Mem);

			

		   for(another big loop)

		   {

				  Operate on shared memory

		   }

	 }

}

This code obviously cannot work because “cacheDataInSharedMemory” requires a __syncthreads() and some threads may have exited the outermost FOR loop (and hence the Kernel) by then.

However, this code runs for minutes together and hence caching of data in shared memory is of utmost importance to me.

Why does NOT __syncthreads() require only LIVE threads to participate ? (This is my understanding)

Why does it also require DEAD threads to participate and hence hang??

Can some1 suggest a better approach?

A possible approach is to run the FOR loop safely with shared memory and then run a small FOR loop for the remainder without using shared memory… but this one does not appeal to me as it involves code duplication and so on…

Appreciate any help

Best Regards,

Sarnath

sheepy13 · December 31, 2008, 8:00am

All,

I have this kernel of mine that runs a big for loop like this:
for(index = function(threadIdx.x, blockIdx.x); index < function2(threadIDx.x, blockIdx.x); index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   cacheDataInSharedMemory(from Global Mem);

			

		   for(another big loop)

		   {

				  Operate on shared memory

		   }

	 }

}
This code obviously cannot work because “cacheDataInSharedMemory” requires a __syncthreads() and some threads may have exited the outermost FOR loop (and hence the Kernel) by then.

However, this code runs for minutes together and hence caching of data in shared memory is of utmost importance to me.

Why does NOT __syncthreads() require only LIVE threads to participate ? (This is my understanding)

Why does it also require DEAD threads to participate and hence hang??

Can some1 suggest a better approach?

A possible approach is to run the FOR loop safely with shared memory and then run a small FOR loop for the remainder without using shared memory… but this one does not appeal to me as it involves code duplication and so on…

Appreciate any help

Best Regards,

Sarnath

If the biggest iteration count of the outermost loop within a block is known when coding, maybe you could change the loop bound with that constant, and use “if” to mask out the caching part and the innermost loop, leaving only __syncthreads(). I think this can avoid code duplication.

Sarnath · December 31, 2008, 8:20am

Thanks for the answer. That would mean that I need to use a “ptr” to point to either shared memory OR global memory depending on the FOR loop iteration.

However PTX code cannot handle such a thing. I need to explicitly use a pointer for global memory AND nother one for shared memory - which means code duplication.

Sarnath · December 31, 2008, 8:24am

My even greater concern is why NOT the hardware is smart enough to find this situation??? May b, probably because it has to maintain per-block state information wich is costly… But somehow I think there must be another way to do this…

It would be good if they have this in their next gen hardware

Tim, Any comments?

sheepy13 · December 31, 2008, 8:57am

I dont understand why you need a “ptr” to point to either shared memory or global memory. The threads that were supposed to exit the outermost loop early, will now keep on running, just doing nothing except waiting for synchronization;the threads that were supposed to exit late, can still use shared mem for caching as usual. There is no need to use global mem directly.

alex_dubinsky · January 1, 2009, 5:53am

sheepy means

int myMax = function2(threadIDx.x, blockIdx.x);

int warpMax = warpReduction(myMax);

for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   if(index < myMax)

		   {

			  cacheDataInSharedMemory(from Global Mem);

			  

			  for(another big loop)

			  {

					 Operate on shared memory

			  }

		   } else {

			  __syncthreads();

		   }

	 }

}

I agree, though, that syncthreads should not wait on exited threads.

Another suggestion is for there to be a 2nd level syncthreads (or 3rd level). It would look something like this:

for()

{

	if()

		break;

	__syncthreads(1);

}

__syncThreads(2);

The way it would work: if a thread exits the for() and hits __syncthreads(2), it would block for another __syncthreads(2), but automatically pass on any __syncthreads(1)s.

Sarnath · January 2, 2009, 4:16am

Sheepy,

Initially I did NOT understand what you said… But that night, when I had a disturbed sleep – it suddenly occured to me what you were referring to. THanks.

@ Alex,

Thanks for your explanation. The second method looks fishy…I will look back detailedly again and get back to you!

Thank you guys!

Best Regards,
Sarnath

Sarnath · January 2, 2009, 4:20am

int myMax = function2(threadIDx.x, blockIdx.x);

int warpMax = warpReduction(myMax);

for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   if(index < myMax)

		   {

			  cacheDataInSharedMemory(from Global Mem);

			  

			  for(another big loop)

			  {

					 Operate on shared memory

			  }

		   } else {

			  __syncthreads();

		   }

	 }

}

THis would still HANG! You can’t SYNC on __syncthreads() in two different places. SYNCTHREADS has a locational dependency (my understanding). All threads have to wait on the same __syncthreads(); The same with case II as well. However, if you move “cacheDataInSharedMemory” outside the check of “index < myMax” – things would work…

i.e. the DEAD threads would participate only in getting data from global to shared memory BUT will NOT participate in other operations!

Oh! Nice to know…

NVIDIA guys, Can you all pass on this message to the hardware team please? Thank you.

Thanks

Best REgards,

Sarnath

E.D_Riedijk · January 2, 2009, 5:00am

[codebox]for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{

 __shared__ sharedmem[];

for(anotherBigLoop)

 {

       if(index < myMax)

       {

          cacheDataInSharedMemory(from Global Mem);

       }

       syncthreads();

if(index < myMax)

       {

          for(another big loop)

          {

                 Operate on shared memory

          }

       }

 }

}[/codebox]

will then work right?

Sarnath · January 2, 2009, 5:02am

[codebox]for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{
 __shared__ sharedmem[];
for(anotherBigLoop)
 {

       if(index < myMax)

       {

          cacheDataInSharedMemory(from Global Mem);

       }

       syncthreads();
if(index < myMax)
       {

          for(another big loop)

          {

                 Operate on shared memory

          }

       }

 }
}[/codebox]

will then work right?

Yes, it should! This is what “sheepy” was saying. Alex was almost right…

Thanks

sheepy13 · January 2, 2009, 6:30am

I was thinking of the same way as E.D.R. Does it work?

BTW, your code brought some issue I did not think of. Is __synchthreads() location dependent or that threads just wait on whichever barrier they are supposed to meet first? According to 4.4.2 of Programming Guide: “__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.” It seems that there should not be any divergence across the whole block if there is a __syncthreads() inside it, so, my way is wrong! Anyone would like to comment on this?

Sarnath · January 2, 2009, 6:37am

It works. The right approach will be to make all threads to particiapte in “cacheDatainSharedMemory()” – This way, we can leverage dead threads in global data-loading (and storing) part. So, EDR’s code has to change a bit and it will reflect what you originally suggested.

If there is a __syncthreads() all threads of a block should execute it in a control path. It is location dependent. You cannot write code that says one part of the block will wait on this __syncthreads() and the other half will wait on another __syncthreads() – This approach will just hang. THats all.

The programmer’s guide is clear on that. You excerpt points it out correctly!

Thanks for your suggestions and ideas!

Best Regards,

Sarnath

alex_dubinsky · January 2, 2009, 8:05am

Hmm, I can see how syncthreads in the same warp are location-dependent, since the warp obviously can’t execute the other side of the conditional until it passes through the barrier. I hadn’t thought of that. But maybe syncthreads of different warps can match up even from different points in the code? Of course, relying on such poorly defined behavior is probably a bad idea.

Btw, check out the PTX manual regarding the bar.sync instruction.

Anyone understand what it’s talking about, or how it would work if exposed in C?

Sarnath · January 2, 2009, 8:47am

Warps have got nothing to do with __syncthreads(). All threads in a block are expected to sync up @ the barrier which is location dependent. i.e. if u have syncthreads then all threads in your block have to reach it. Otherwise, you have a hang.

The PTX quote looks interesting… However, it is NOT translating the same way in the CUDA manual…

tmurray · January 2, 2009, 7:55pm

The easiest way to do this is with a goto. (oh no, here come all the “goto considered harmful” posts)

(actually, I’m not 100% sure if gotos work from kernels–I assume they do, but I don’t remember having ever tried it. regardless, they’re the easiest way to do this)

E.D_Riedijk · January 3, 2009, 3:23pm

Can you show how the code would look? Because I cannot see how a goto would get rid of the double if().

Sarnath · January 4, 2009, 3:11am

As simple as “if (index >= myMax) continue;” “continue” is an implicit goto statement, right?

In this case, there must be a __syncthreads() at the start of the FOR loop so that all threads begin their journey together from the start of the FOR loop…

Topic		Replies	Views
Semantics of __syncthreads CUDA Programming and Performance	18	18000	January 2, 2008
__syncthreads question CUDA Programming and Performance	9	2031	September 30, 2009
Global thread barrier CUDA Programming and Performance	78	85729	December 23, 2011
Problem with correct branching within a warp CUDA Programming and Performance	23	15664	May 28, 2009
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6691	June 2, 2012
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13763	January 21, 2024
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11546	January 2, 2011
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16832	January 5, 2008
A stupid question on __syncthread() function CUDA Programming and Performance	5	5281	May 17, 2022
global to shared mem loads and sync CUDA Programming and Performance	26	11456	February 21, 2008

__syncthreads() limitation.. Help please..

Related topics