Test Multi Threading Spinning

Hello,

Apperently cuda keeps spinning an entire thread block and never moves on to the next thread block.

Is there a way/trick to do it ? (Perhaps make it still on memory access or so ? But what if it’s cached could be unreliable, it would be nice if there was some “yield” function so it moves on to the next block…

(These are the kinds of the tests the sdk should contain ! External Image :) None the less doing it myself is best External Image External Image

Kernel:

extern "C" 

{ // extern c begin

/*

Test Cuda Multi Threading Spinning

version 0.01 created on 19 july 2011 by Skybuck Flying

This code seems to work fine for an entire thread block

but when there are multiple thread blocks it stops working, the first thread block while hang...

cuda will simply keep executing the first thread block endlessly, is there a way to fix it and make it move on with the next thread block ?

*/

// simply pass in a "total thread count" and pass out a "integer" which will contain the last index.

__global__ void Kernel( int Count, int *Result)

{

	int MyIndex;

	

	// the shared index which is to be passed on from thread to thread

	__shared__ int Index;

	

	// the shared running variable

	__shared__ bool Running;

	

	

	MyIndex = 

		(threadIdx.x) + 

		(threadIdx.y * blockDim.x) + 

		(threadIdx.z * blockDim.x * blockDim.y) + 

		(blockIdx.x * blockDim.x * blockDim.y * blockDim.z) + 

		(blockIdx.y * blockDim.x * blockDim.y * blockDim.z * gridDim.x) +

		(blockIdx.z * blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y);

		

	// ignore threads outside of the problem zone.

	if (MyIndex < Count)

	{

		// each thread should spindle around to see what happens...

	

		// when it's his turn he should advanced some counter/index

		// the last thread sets Running to false

		// first thread sets running to true

		// this little motherfreaker will also initialize the shared index.

		if (MyIndex == 0)

		{

			Running = true;

			Index = 0;

		}

		while (Running)

		{

			// if index = last thread then		

			if (Index == (Count-1))

			{

				// last thread will stop the running.

				Running = false;				

			}	

			// this needs to be below otherwise it will be incremented before the last check can be done...

			// but it must be done first <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :) so it's now above.			

			// if current index = my index then

			if (Index == MyIndex)		

			{

				// pass index to next thread.

				Index = Index + 1;		

			}

		}

	

		// we also have to do something usefull/output otherwise compiler is going to be a bitch.

		// so here we stuff the last index into the output to verify.

		// and only last thread will do this ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :)

		// it make sense you know ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :)

		if (MyIndex == (Count-1))

		{

			*Result = Index;

		}

	}

}

} // extern c end

Bye,

Skybuck.

One solution which comes to mind is to exit the thread when it’s done, however that would only work in this simple example.

In a more complex example or a real world situation, a thread would have to communicate with other threads many times.

This example only does communication once.

So a solution for this example should assume that all threads in all blocks should remain spinning until the very last thread has executed and sets running to false.

There should be some way to put threads to “sleep” and to “wake” them when they need to do something… ?!?

Or cuda simply needs to start executing all threads in a round robin fashion… it’s kinda surprising that it doesn’t do that.

This apperently has to do with the performance… cuda likes to keep existing a single thread block on a single multi processor until it’s done executing… otherwise it would probably need to swap “thread” resources in and out too many times… so from a performance point perspective I can understand that… but from a coding point perspective this is bad and doesn’t make much sense.

typo: existing should be executing.

Another solution would be to make kernels aware of this and use “thread block limit” (also known as max threads per block) to adept the algorithm to the hardware.

However for graphics cards with multiple streaming processors the algorithm would also need to keep track and adept to the number of multi processors present.

While all doable probably this does require passing two extra parameters and/or extra register usage, unless it’s a build-in variable ?

Ofcourse this also puts (unnecessary) strain/stress an algorithm design…

In other words for algorithm design it would be better if the algorithm is oblivious to the hardware and can simply assume a million or a billion threads all executing in parallel, without having to worry about these weird/whacky hardware limitations which suddenly cause blocking behaviour/hangs.

So to me it’s starting to seem that the number of threads cuda can execute concurrently is:

MaxThreadsPerBlock * MultiProcessors (* Devices).

After that it’s out of resources (thread resources) like thread contexts.

A cpu could theoretically probably have unlimited thread contexes, while cuda seems limitted to a certain number see formula above.

Good to know this.

This behavior is documented in the programming guide and should come as no surprise. I will only add to this that there is no guarantee that even running a small number of blocks will prevent deadlock - imagine if the GPU were to launch two kernels concurrently.

This code can obviously work only for a single block, as [font=“Courier New”]Index[/font] and [font=“Courier New”]Running[/font] are shared variables (which means they are shared between all threads of a block). Thus with [font=“Courier New”]Index[/font] starting over from zero for each block, [font=“Courier New”](Index == MyIndex)[/font] can only become true for threads in the first block.

No, there should not, as it would be highly inefficient. Instead of trying to wake up an old threads, just use a new one.

Ok, I think you got me there Tera, indeed I just remembered “shared” only applies to blocks, which in itself is another kind of weirdness. Would have been nice if all blocks shared it then it would have been truely shared.

Anyway now I see where a potential problem might be, I shall have to re-consider the algorithm and come up with a new one if possible.

Perhaps some kind of global memory. However I was hoping to use shared memory because it’s fast, but now I can’t because it’s “block limited” which kinda sucks, but ok.

Perhaps the shared variables can be passed onto the next block, I think global memory could possible be used for that/this.

I can imagine the last thread of a block passes it on.

However this again illustrates that the algorithms have to be adepted to hardware limitations :(

volatile

While scheduling in CUDA is a grey area, it doesn’t schedule threads in round-robin fashion (nor should it). Instead it switches between threads based on memory accesses in order to hide latency.

So having no inter-block dependencies is imperative although I would prefer having a well defined set of rules which would allow for some inter block locking/synchronization. I’m talking about some guarantee (based on hardware specs) that running at most a certain number of blocks will not lead to deadlock. It seams reasonable to me that such a thing can be easily defined as having fewer concurrent blocks than what can be mapped to the hardware seems like a sufficient rule to me.

But then again there may be a lot more to this that I’m not considering.

why this is a bad idea: The Official NVIDIA Forums | NVIDIA

version 0.02 uses global memory instead of shared memory, the problem remains however.

Other blocks are never executed, the first block keeps spinning :(

code for version 0.02:

extern "C" 

{ // extern c begin

/*

Test Cuda Multi Threading Spinning

version 0.01 created on 19 july 2011 by Skybuck Flying

This code seems to work fine for an entire thread block

but when there are multiple thread blocks it stops working, the first thread block while hang...

cuda will simply keep executing the first thread block endlessly, is there a way to fix it and make it move on with the next thread block ?

Version 0.01 could not work because shared memory is only shared between the threads of a block. The threads in all other blocks are oblivious of it.

*/

/*

version 0.02 created on 20 july 2011 by Skybuck Flying

Version 0.02 will now attempt to solve this mistake. An efficient method would be to pass the shared

result to global memory by the last thread and then back to the first thread of the next block.

However I would also like to test global memory consistency to see if there are any race conditions.

So I will move the shared variables to global memory and then see what happens.

The problem remains, so it just seemed as if using shared memory was the problem, it might also have been a problem.

But there are further problems. Block 2 never gets executed it seems ?!?

Again quite surprising ?!?

*/

// the shared index which is to be passed on from thread to thread

__device__ int Index;

	

// the shared running variable

__device__ bool Running;

// simply pass in a "total thread count" and pass out a "integer" which will contain the last index.

__global__ void Kernel( int Count, int *Result)

{

	int MyIndex;

	

	MyIndex = 

		(threadIdx.x) + 

		(threadIdx.y * blockDim.x) + 

		(threadIdx.z * blockDim.x * blockDim.y) + 

		(blockIdx.x * blockDim.x * blockDim.y * blockDim.z) + 

		(blockIdx.y * blockDim.x * blockDim.y * blockDim.z * gridDim.x) +

		(blockIdx.z * blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y);

		

	// ignore threads outside of the problem zone.

	if (MyIndex < Count)

	{

		// each thread should spindle around to see what happens...

	

		// when it's his turn he should advanced some counter/index

		// the last thread sets Running to false

		// first thread sets running to true

		// this little motherfreaker will also initialize the shared index.

		if (MyIndex == 0)

		{

			Running = true;

			Index = 0;

		}

		while (Running)

		{

			// if index = last thread then		

			if (Index == (Count-1))

			{

				// last thread will stop the running.

				Running = false;				

			}	

			// this needs to be below otherwise it will be incremented before the last check can be done...

			// but it must be done first <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :) so it's now above.			

			// if current index = my index then

			if (Index == MyIndex)		

			{

				// pass index to next thread.

				Index = Index + 1;		

			}

		}

	

		// we also have to do something usefull/output otherwise compiler is going to be a bitch.

		// so here we stuff the last index into the output to verify.

		// and only last thread will do this ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :)

		// it make sense you know ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' /> :)

		if (MyIndex == (Count-1))

		{

			*Result = Index;

		}

	}

}

} // extern c end

I’m a bit fuzzy on this… which section(s) of the guide mention this behaviour and further explain that hardware can only execute up to “max threads per block * multi processors (* devices)” ?

(In this sense perhaps a “block” can be interpreted as a multi processor, so it could also have been called “max threads per multi processor”, however that variable already exists, the first one (max threads per block) is 1024 the second one (max threads per multi processor) is 1536 for my GT 520. So it seems a block is something special like a piece of memory which can hold a certain ammount of thread contexts, so a block is like a data structure.)

Hmm some sections of the guide mention: “resident blocks” “resident warps” and so forth, resident this, resident that etc.

Section 4.2 mention this residency issue… it then refers to appendix F where further tables are shown with further restrictions it seems.

For example one restriction reads:

“Maximum number of resident blocks per multi processor = 8”.

This seems like another constraint to me… a single multi processor can only execute 8 blocks, if all of these blocks require for example a 9th block which resides on the multiprocessor as well to to terminate then these 8 blocks will again never terminate and another hang/spinning problem is created.

So this is interesting information for people/programmers using tiny little blocks, which will probably be rare since blocks should be 32 in thread size at least but still.

32*8 = 256 threads, so there is a potential for issues.

64*8 = 512 threads, still potential issues.

Conclusion: on any given graphics card there could be a situation where there are so many threads or blocks are running that it exceeds the total ammount of running capacity of the hardware and thus “deadlocks” might occur.

The only way to be sure to prevent this is to make sure each block terminates by itself, each block should be completely independent of the results of other blocks.

This seems quite a heavy requirements and constraint.

Yup I tested this little theory/restriction:

It was quite easy to test: thread dimensions where set to (32,1,1)

Number of total threads was set to 1024.

Which leads to: 1024/32 = 32 blocks needed.

The GT 520 has only one multi processor, so it can only execute 8 blocks, the cuda device summary shows 10 blocks/warps to be present.

It probably hangs on block 9…

So another lessons learned:

Even if the hardware says: “max threads per block = 1024” there are further constraints which much be respected/met.

So just changing the “kernel launch dimensions” and assuming all is well, can lead to dead locks.

Ok,

Now things are starting to make sense to me,

Another constraint is:

“Maximum number of resident warps per multiprocessor” which is 48 for compute capability 2.0.

Normally each warp will have 32 threads.

So 48*32 = 1536. (I have seen this number before ! External Image External Image

Now the question is:

Can an example be made which would still dead lock a thread block of (1024,1,1) ?

Somehow the example would need to force a warp to be just a few threads… I am not sure if that can be done…

It’s also a bit strange why compute 2.0 has: “maximum threads per block” set to 1024… instead of 1536… perhaps it was a design issue, they forgot or weren’t able to increase contexts per thread block from 1024 to 1536, or perhaps it was an esthetic decision, none the less, this does hurt performance a bit, about 50% less performance at least for kernels where shared memory is used, those blocks could have done 50% more work…

I have also seen strange behaviour of graphics cards with malfunctioning kernels, even after everything was shutdown, the graphics card seemed to lag the system, probably because watchdog is disabled.

Another explanation why the number 1536 could be to have some “buffer room” for malfunctioning threads or so… so the hardware might be able to deal with slightly malfunctioning kernels or so…