Bug report: Incorrect block scheduling

Operating System: Windows XP 64-bit, but host and device code compiled as 32 bit

CUDA toolkit: v. 2.3

SDK: v. 2.3

Compiler: Visual Studio 2008, v. 9.0.21022.8 RTM


CPU Intel Core2 Duo E8600 3.33GHz,

GPU NVIDIA GeForce 9600GT (for display only),

GPU NVIDIA GeForce GTX 285 (for CUDA computation)

Driver: 191.07

Problem: Under certian conditions (see below) not all blocks of a kernel are executed at once despite the device being capable of doing so


Consider launching two kernels (both with occupancy=1) on a device of compute capability 1.3 and with 30 stream multiprocessors:



Let q be equal to 60.

If p is smaller than X (X equals #SM*2=60 in my case, but cannot assert that particular value elsewhere) then kernelB is initially launched with less than q blocks, and only after some blocks end their work, new are scheduled in. This can lead to serious drop in performance and in some cases may cause a program to work incorrectly - for example, if the kernel assumes all its blocks are active.


__global__ void emptyKernel() {}

__global__ void myKernel(int *control, int *output) {

	int numberOfAttempts=0;

	int terminate=1234567;



//attempt to synchronise all blocks

	if (threadIdx.x==0) {


		volatile int v;

		do {


			v=*(volatile int *)control; //volatile to force read from global

			if (numberOfAttempts==terminate) {

				output[blockIdx.x]=0; //we waited too long. Report the problem and quit the loop



		} while (v<gridDim.x);




int main() {

/* [...] initialisation which I skipped here, e.g. choosing the best GPU */

	const int maxBlocks=60;

	int *gpuControl;

	cudaMalloc((void**)&gpuControl, sizeof(int));

	int cpuControl=0;


	int *gpuOutput;

	cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks);

	int cpuOutput[maxBlocks];

	int thr=480;

	int p=13;

	int q=60;


	cudaThreadSynchronize(); //not needed but to be on the safe side...

	myKernel<<<q,512>>>(gpuControl, gpuOutput);



	for (int i=0; i<q; ++i)

		printf("%d ",cpuOutput[i]);


Expected output:

Working on GeForce GTX 285, 30 SMs, CC=1.3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Obtained output:

Working on GeForce GTX 285, 30 SMs, CC=1.3

1 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1

Different parameters:

The output depends on parameteres p, q and thr (see the code). For some values, the output is correct, for others - it is not.

  • If thr is between 481 and 512, the output is correct regardless of values p and q. That particular value of thr causes both kernels to have the same number of warps per block.

  • If q>=60 this leads to an obvious and expected error as GTX 285 is unable to launch more than 60 blocks in parallel.

  • If p multiple of 60 the behaviour is correct

  • If q<=60-p the behaviour is also correct

  • So the bug occurs if p<60, 60-p<q<=60

Does the problem go away by putting the CPU to sleep for a few seconds after a cudaThreadSynchronize() ?

Very crude theory:
Maybe it takes a while for the SMs to be cleared of their “busy” state after the empty kernel launch, so an immediate
launch of another kernel doesn’t schedule work on all SMs initially.


No, I may even add, that you can put arbitrary many other “proper” kernel calls in between:

Try this:




	for (int j=0; j<1024; ++j) {





	myKernel<<<q,512>>>(gpuControl, gpuOutput);

It will cause incorrect results, as long as thr is lower than 481.

My crude theory is that some active warp counters are not reset after the end of kernel call. So, all SM have to execute a work of 32 warps before 33rd warp can be launched regardless on the boundaries of the kernels.

Inter-block communication is totally unsupported. (actually it’s a violation of the programing model we lay out in the programming guide, for whatever that’s worth)

Inter-block communication is only a tool to give a visualisation to the problem and not the sole topic of this bug.

I am talking about how work is scheduled between stream multiprocessors. If I launch 60 blocks I expect them to run all in parallel, but obviously - as shown in this example - that is not the case.

Atomic operations are supported since CC 1.1, and I believe their purpose is to provide some kind of communication between blocks - for example, so that each block fetches a different data, or something… If you assume absolutely no interactions between the blocks, why would atomic operations on global memory exist in the first place? Shared memory atomics would suffice…

I spent a lot of time tracking the problem in my kernel and quite some to finally describe to you what is going on.

Therefore, please look into the problem of scheduling and not just dismiss it with statement about inter-block communication.

P.S. To leave the problem of inter-block synchronisation completly, consider the following change in the above example:

Replace myKernel with the following:

__global__ void myKernel(int *control, int *output) {

	for (int i=0; i<12345678; ++i) {

		if (threadIdx.x==0)




and kernel calls (after cudaThreadSynchronize()):

CudaTimer t;



	myKernel<<<q,512>>>(gpuControl, gpuOutput);



	printf("Elapsed time: %lf\n",t.get());

The CudaTimer is my class which computes elapsed time based on events. You can use one provided from cutil or something else you have…

If p=13 and thr=220, the run time of myKernel on my machine is:

Working on GeForce GTX 285, 30 SMs, CC=1.3

Elapsed time: 16623.390625

If p=60 and thr=512, the run time of myKernel on my machine is:

Working on GeForce GTX 285, 30 SMs, CC=1.3

Elapsed time: 13168.300781

Which means, the second code is 26% faster than the first one, just because of the kernel configuration of the call on that empty kernel just before it!

So, umm can you provoke a test case where on the second kernel only a single MP is provided with work? What has been the worst configuration that you’ve encountered so far?

My suggestion: Until nVidia provide a fix, always launch an empty “cleanup” kernel with a “known good” block configuration (e.g. x times number of MPs blocks) before your time-critical kernels. ;)

In my view this is a driver problem and should be addressed because it does affect the following kernel’s timings in an unpredictable way.

I have reasons to believe that if I launch:
followed by
then initially only one block of myKernel will be launched and only one stream multiprocessor will do something.
However, once it ends, all other 59 blocks will be scheduled at once. So, for typical kernels I wouldn’t expect drops in efficiency worse than x2.

I say “I have reasons” because I use a more sophisticated debugging approach which shows that, but since it is not 100% reliable I decided not to report it and write a simpler, more direct code.

However block execution may depend on other’s work. For example, a suggested approach (by NVIDIA as well) is to launch one “uber-kernel” where each block reads a chunk of data from input FIFO queue using an atomic instruction, process it and then reads next chunk in a loop. The approach automatically balance the workload between blocks, but what if accidently only one block is launched instead?

Why don’t you use the clock function to write at which time your block was scheduled? You can see how to in the clock example ( if (threadIdx.x==0) g_time_arr[blockIdx.x] = clock(); in the beginning of your kernel )

That way you know much more accurately at what time each block started to run.

The results of clock() are not comparable between SMs.

Although clock() is not comparable, I decided to follow Riedijk’s idea of checking the scheduling order. To that end I use a single global variable which is atomically incremented by each block at beginning and near its end:

__global__ void emptyKernel() {}

__global__ void myKernel3(int *control, int *output) {

	if (threadIdx.x==1) {

	int enter=atomicAdd(control,1); //register that we enter


	for (int i=0; i<12345678; ++i) { //some intensive and long task



	int exit=atomicAdd(control,1); //register that we quit




int main() {

/* [...] initialisation */

	const int maxBlocks=60;

	int *gpuControl;

	cudaMalloc((void**)&gpuControl, sizeof(int));

	int cpuControl=0;


	int *gpuOutput;

	cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);

	int cpuOutput[maxBlocks*2];

	for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side


	const int thr=479;

	const int p=13;

	const int q=60;



	myKernel3<<<q,512>>>(gpuControl, gpuOutput);



	for (int i=0; i<q; ++i)

		printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);


The output:

Working on GeForce GTX 285, 30 SMs, CC=1.3

0: 14-63

1: 15-67

2: 22-75

3: 33-81

4: 39-90

5: 46-102

6: 34-106

7: 2-95

8: 35-93

9: 7-98

10: 12-62

11: 13-66

12: 21-74

13: 32-80

14: 38-89

15: 45-101

16: 25-103

17: 1-48

18: 31-54

19: 5-50

20: 8-59

21: 11-65

22: 18-71

23: 24-77

24: 36-55

25: 42-57

26: 23-52

27: 0-47

28: 29-53

29: 4-49

30: 10-61

31: 17-69

32: 20-73

33: 30-79

34: 41-92

35: 44-100

36: 27-105

37: 3-96

38: 37-94

39: 6-97

40: 9-60

41: 16-68

42: 19-72

43: 26-78

44: 40-91

45: 43-99

46: 28-104

47: 51-107

48: 58-109

49: 56-108

50: 64-110

51: 70-111

52: 76-112

53: 82-113

54: 83-114

55: 88-119

56: 86-118

57: 84-115

58: 85-117

59: 87-116

This shows that exactly p blocks are scheduled only when one of the previous blocks ends its work. That behaviour holds for 1<=p<30. For p between 30 and 60 there are 30 blocks scheduled later - not more.

This also shows that in my previous statement “I have reasons to believe…” I was mistaken.

It is now few weeks since I posted it.
Any news? Someone working on it or was it simply discarded?

Maybe coming up with a simple, yet shocking code sample that demonstrates how the runtime of a kernel can be doubled by previously launching a kernel with a specific grid dimension would wake them up. Maybe some folks already suffer from this effect in their applications without knowing it.


That could be done, but it needs time. I believe current hints about the problem are enough.
Despite patronizing statements from some of NVIDIA people I hope they already look into this problem.

Block Scheduling is an old discussed topic. Me and a few people worked on it and finally found that the order in which blocks are scheduled.

Here is the gist. I dont know if it relates to your problem… Just my 2 cents:
The SM scheduler schedules the block one by one on consecutive MPs in an interleaved fashion until the active set is saturated. It then kick starts kernel execution. When ALL the blocks in the active set gets over, the SM scheduler schedules the next active set batch and so on…
if the work load is not balanced, this will surely expose latencies…

May be, FERMI changes all of these… IDK…

Best Regards,

Interesting observation.
Maybe it is the case, that something is not flushed out after previous kernel and warps/SM are seen as if they are already occupied when the new kernel is launched.

This looks kinda like a bug that I reported here, in case that helps anyone.

Best way would be to simply open a bug for nVidia in their developer site.

Make sure you post a code to reproduce it and as much information possible to prevent wasting time on



Code is here and some NVIDIA guys are reading it. And there is even a special thread on the forum on how to report bugs here.
So it is just up to them to acknowledge the problem and do something about it…

I appologise for digging up an old thread. I am however interested if anyone at all is working on this problem (or maybe it is already solved?).
Or - if it is for example a hardware problem - and cannot be helped?