__syncthreads() limitation.. Help please..

All,

I have this kernel of mine that runs a big for loop like this:

for(index = function(threadIdx.x, blockIdx.x); index < function2(threadIDx.x, blockIdx.x); index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   cacheDataInSharedMemory(from Global Mem);

			

		   for(another big loop)

		   {

				  Operate on shared memory

		   }

	 }

}

This code obviously cannot work because “cacheDataInSharedMemory” requires a __syncthreads() and some threads may have exited the outermost FOR loop (and hence the Kernel) by then.

However, this code runs for minutes together and hence caching of data in shared memory is of utmost importance to me.

Why does NOT __syncthreads() require only LIVE threads to participate ? (This is my understanding)

Why does it also require DEAD threads to participate and hence hang??

Can some1 suggest a better approach?

A possible approach is to run the FOR loop safely with shared memory and then run a small FOR loop for the remainder without using shared memory… but this one does not appeal to me as it involves code duplication and so on…

Appreciate any help

Best Regards,

Sarnath

If the biggest iteration count of the outermost loop within a block is known when coding, maybe you could change the loop bound with that constant, and use “if” to mask out the caching part and the innermost loop, leaving only __syncthreads(). I think this can avoid code duplication.

Thanks for the answer. That would mean that I need to use a “ptr” to point to either shared memory OR global memory depending on the FOR loop iteration.

However PTX code cannot handle such a thing. I need to explicitly use a pointer for global memory AND nother one for shared memory - which means code duplication.

My even greater concern is why NOT the hardware is smart enough to find this situation??? May b, probably because it has to maintain per-block state information wich is costly… But somehow I think there must be another way to do this…

It would be good if they have this in their next gen hardware

Tim, Any comments?

I dont understand why you need a “ptr” to point to either shared memory or global memory. The threads that were supposed to exit the outermost loop early, will now keep on running, just doing nothing except waiting for synchronization;the threads that were supposed to exit late, can still use shared mem for caching as usual. There is no need to use global mem directly.

sheepy means

int myMax = function2(threadIDx.x, blockIdx.x);

int warpMax = warpReduction(myMax);

for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{

	 __shared__ sharedmem[];

	 for(anotherBigLoop)

	 {

		   if(index < myMax)

		   {

			  cacheDataInSharedMemory(from Global Mem);

			  

			  for(another big loop)

			  {

					 Operate on shared memory

			  }

		   } else {

			  __syncthreads();

		   }

	 }

}

I agree, though, that syncthreads should not wait on exited threads.

Another suggestion is for there to be a 2nd level syncthreads (or 3rd level). It would look something like this:

for()

{

	if()

		break;

	__syncthreads(1);

}

__syncThreads(2);

The way it would work: if a thread exits the for() and hits __syncthreads(2), it would block for another __syncthreads(2), but automatically pass on any __syncthreads(1)s.

Sheepy,

Initially I did NOT understand what you said… But that night, when I had a disturbed sleep – it suddenly occured to me what you were referring to. THanks.

@ Alex,

Thanks for your explanation. The second method looks fishy…I will look back detailedly again and get back to you!

Thank you guys!

Best Regards,
Sarnath

THis would still HANG! You can’t SYNC on __syncthreads() in two different places. SYNCTHREADS has a locational dependency (my understanding). All threads have to wait on the same __syncthreads(); The same with case II as well. However, if you move “cacheDataInSharedMemory” outside the check of “index < myMax” – things would work…

i.e. the DEAD threads would participate only in getting data from global to shared memory BUT will NOT participate in other operations!

Oh! Nice to know…

NVIDIA guys, Can you all pass on this message to the hardware team please? Thank you.

Thanks

Best REgards,

Sarnath

[codebox]for(index = function(threadIdx.x, blockIdx.x); index < warpMax; index+=function3(threadIDx.x, blockIdx.x))

{

 __shared__ sharedmem[];

for(anotherBigLoop)

 {

       if(index < myMax)

       {

          cacheDataInSharedMemory(from Global Mem);

       }

       syncthreads();

if(index < myMax)

       {

          for(another big loop)

          {

                 Operate on shared memory

          }

       }

 }

}[/codebox]

will then work right?

Yes, it should! This is what “sheepy” was saying. Alex was almost right…

Thanks

I was thinking of the same way as E.D.R. Does it work?

BTW, your code brought some issue I did not think of. Is __synchthreads() location dependent or that threads just wait on whichever barrier they are supposed to meet first? According to 4.4.2 of Programming Guide: “__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.” It seems that there should not be any divergence across the whole block if there is a __syncthreads() inside it, so, my way is wrong! Anyone would like to comment on this?

It works. The right approach will be to make all threads to particiapte in “cacheDatainSharedMemory()” – This way, we can leverage dead threads in global data-loading (and storing) part. So, EDR’s code has to change a bit and it will reflect what you originally suggested.

If there is a __syncthreads() all threads of a block should execute it in a control path. It is location dependent. You cannot write code that says one part of the block will wait on this __syncthreads() and the other half will wait on another __syncthreads() – This approach will just hang. THats all.

The programmer’s guide is clear on that. You excerpt points it out correctly!

Thanks for your suggestions and ideas!

Best Regards,

Sarnath

Hmm, I can see how syncthreads in the same warp are location-dependent, since the warp obviously can’t execute the other side of the conditional until it passes through the barrier. I hadn’t thought of that. But maybe syncthreads of different warps can match up even from different points in the code? Of course, relying on such poorly defined behavior is probably a bad idea.

Btw, check out the PTX manual regarding the bar.sync instruction.

Anyone understand what it’s talking about, or how it would work if exposed in C?

Warps have got nothing to do with __syncthreads(). All threads in a block are expected to sync up @ the barrier which is location dependent. i.e. if u have syncthreads then all threads in your block have to reach it. Otherwise, you have a hang.

The PTX quote looks interesting… However, it is NOT translating the same way in the CUDA manual…

The easiest way to do this is with a goto. (oh no, here come all the “goto considered harmful” posts)

(actually, I’m not 100% sure if gotos work from kernels–I assume they do, but I don’t remember having ever tried it. regardless, they’re the easiest way to do this)

Can you show how the code would look? Because I cannot see how a goto would get rid of the double if().

As simple as “if (index >= myMax) continue;” “continue” is an implicit goto statement, right?

In this case, there must be a __syncthreads() at the start of the FOR loop so that all threads begin their journey together from the start of the FOR loop…