Possible CUDA bug for multiple active blocks

cvnguyen · June 16, 2009, 3:18am

Hi everyone,

I would like to replicate a weird CUDA bug here. It causes problems when there are more than one active block in a MP.

The full test code is as following:

#include <stdio.h>

#include <stdlib.h>

#define GRIDSIZE 28

#define BLOCKSIZE 256

__device__ int sumval = 2000;

typedef struct {

#if (0)

	int buf[1786];

#else

	int buf[1785];

#endif

	int flag;

} patch_st;

__global__ void mykernel()

{

	volatile __shared__ patch_st patch;

	volatile __shared__ int points[BLOCKSIZE + 1];

	volatile __shared__ int changed;

	// Initialize values

	if (!threadIdx.x)

	{

		points[0] = 1;

		points[BLOCKSIZE] = 0;

		patch.flag = 1;

	}

	else

		points[threadIdx.x] = 0;

	__syncthreads();

	do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

		if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

	if (!threadIdx.x)

	{

		int sum = 0, i;

		for (i = 0; i < BLOCKSIZE; i++)

			sum += points[i];

		atomicMin(&sumval, sum);

		patch.flag = 0;

	}

}

int main(void)

{

	dim3 grid, block;

	int hostsum;

	grid.x = GRIDSIZE;

	block.x = BLOCKSIZE;

	//cudaSetDevice(0);

	mykernel<<<grid, block>>>();

	cudaThreadSynchronize();

	printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

	cudaMemcpyFromSymbol(&hostsum, "sumval", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("Min sum = %d\n", hostsum);

	

	return 0;

}

The struct [font=“Courier New”]patch_st [/font]is for controlling the size of shared memory space occupied by each block (by varying the size of [font=“Courier New”]buf[/font] field).

First of all, I set the value of GRIDSIZE to the number of multiprocessors available on the GPU (27 for my GLX260) and the size of shared memory to 8192 bytes (half of the space available on each MP). With these settings, the program yields a correct result (Min sum = 256). In this case each block is run on one MP, so there is only one active block per MP.

$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu

testbug1.cu

tmpxft_00002348_00000000-3_testbug1.cudafe1.gpu

tmpxft_00002348_00000000-8_testbug1.cudafe2.gpu

ptxas info	: Compiling entry function '_Z8mykernelv'

ptxas info	: Used 2 registers, 8192+8188 bytes smem, 8 bytes cmem[1], 4 bytes

 cmem[14]

tmpxft_00002348_00000000-3_testbug1.cudafe1.cpp

tmpxft_00002348_00000000-13_testbug1.ii

Next, I increase GRIDSIZE while keeping the size of shared memory still 8192 bytes. Now there is a MP running 2 blocks, and these 2 blocks can be run simultaneously (2 active blocks) because each of them requires only a half of shared memory space available (the number of registers in use is quite small enough). With these settings, the program yields random incorrect result (less than 256). When there are more than one active block, the WHILE loop in the above code does not work as expected.

Then, I increase the size of shared memory to 8196 bytes (larger than a half of available space), GRIDSIZE is kept unchanged (28 for my GLX260). In this case, although there is a MP running 2 blocks, these 2 blocks cannot be run simultaneously. With these settings, the program becomes working properly (Min sum = 256).

$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu

testbug1.cu

tmpxft_0000147c_00000000-3_testbug1.cudafe1.gpu

tmpxft_0000147c_00000000-8_testbug1.cudafe2.gpu

ptxas info	: Compiling entry function '_Z8mykernelv'

ptxas info	: Used 2 registers, 8196+8192 bytes smem, 8 bytes cmem[1], 4 bytes

 cmem[14]

tmpxft_0000147c_00000000-3_testbug1.cudafe1.cpp

tmpxft_0000147c_00000000-13_testbug1.ii

The issue also happens on my 8200 mGPU.

I’ve run the test on WinXP 32-bit, CUDA 2.2.

Anybody can explain the issue? I am wondering if this is a hardware/tool bug.

SPWorley · June 16, 2009, 4:34am

It may be very relevant, but you do not really have 16384 bytes of shared memory. Each block uses a few extra words to store thread and block ID and shape. If you’re asking for 8192 bytes of shared memory, and CUDA is assigning two blocks per MP, then something is wrong since the real use is 8192+overhead and there isn’t really enough shared RAM to support that.

How much per-block shared memory overhead there is is a little mysterious, but I believe it was something like 12 words per block. A very old forum thread from over a year ago discussed this via some sneaky analysis of raw pointer probe experiments.

Now whether or not CUDA is doing something wrong or there’s a problem in your code, I don’t know, but your statement that each block uses 8192 bytes and two blocks run at once immediately throws up “err… that shouldn’t be true” thoughts.

tmurray · June 16, 2009, 4:44am

you’re not checking to see if it actually gets launched, and it shouldn’t be. the correct error checking code would be

mykernel<<<grid, block>>>();

if (cudaGetLastError() != cudaSuccess)

{

  printf("error in launching kernel\n");

  return 1;

}

cudaThreadSynchronize();

...

cvnguyen · June 16, 2009, 5:32am

you’re not checking to see if it actually gets launched, and it shouldn’t be. the correct error checking code would be
mykernel<<<grid, block>>>();

if (cudaGetLastError() != cudaSuccess)

{

  printf("error in launching kernel\n");

  return 1;

}

cudaThreadSynchronize();

...

I’ve added your code into the test. The kernel are launched successfully in all cases.

tmurray · June 16, 2009, 5:38am

intriguing… okay now I will look into this a lot more.

cvnguyen · June 16, 2009, 5:39am

It may be very relevant, but you do not really have 16384 bytes of shared memory. Each block uses a few extra words to store thread and block ID and shape. If you’re asking for 8192 bytes of shared memory, and CUDA is assigning two blocks per MP, then something is wrong since the real use is 8192+overhead and there isn’t really enough shared RAM to support that.

How much per-block shared memory overhead there is is a little mysterious, but I believe it was something like 12 words per block. A very old forum thread from over a year ago discussed this via some sneaky analysis of raw pointer probe experiments.

Now whether or not CUDA is doing something wrong or there’s a problem in your code, I don’t know, but your statement that each block uses 8192 bytes and two blocks run at once immediately throws up “err… that shouldn’t be true” thoughts.

If I reduce the shared memory size in used (much less than 8192 bytes), the problem will still happen. I select the sizes of 8192 and 8196 just for better visualization.

Sarnath · June 16, 2009, 5:45am

do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

		if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

Hello CVN, Good to see you back in action Mate…

Coming to point… The IF statement in do-while loop that compares – has a race condition at WARP boundaries… No? The way you are updating points[threadIdx.x+1] causes a race… If u r updating only points[threadIdx.x] – then it does NOT matter…

Isnt it?

cvnguyen · June 16, 2009, 6:30am

Sorry, I cannot catch your idea, Sarnath. Could you recommend a working code for that?

Sarnath · June 16, 2009, 6:45am

Sorry, useless post… (edited)

cvnguyen · June 16, 2009, 7:18am

No, the execution order of threads does not affect the program logic, Sarnath. All threads must synchronize at __syncthreads() right after the while loop.

Sarnath · June 16, 2009, 7:23am

Sorry CVN. I missed that 2nd syncthreads() out there. Its all fine.

Not sure why you are seeing less than 256. Have u tried just doing “atomicAdd” and see if that works out fine?

Sarnath · June 16, 2009, 7:39am

Problem reproducible on my TESLA C1060 (that was bought – so no pre-production thing).

For me, I was able to clearly see the prblem when GRIDSIZE becomes 92. It is reproducible at 91 as well… but with lesser frequency.

Sarnath · June 16, 2009, 7:51am

OK,

I just declard per-block array and outputed the SUM out there…
I find that the SUM computed for certain blockIds are just NOT 256. Even if comment out “atomicMin” out – the problem appears.
Thus, atomics can be safely ruled out of this equation.

The problem occurs with lesser shared memory sizes as well (as claimed by CVN)

The blockIds exhibiting such behaviour in my TESLA are 60,61,90,91. (for total of 92 blocks)
Increasing the blocksize to 600 destroys this pattern and there are lot of blocks that exhibit such behaviour.

A screenshot:
$ ./a
err = no error
Min sum = 2000 /* atomicMin was commented out… Hence */
hostSumArray[97] = 209 – not equal to BLOCKSIZE
hostSumArray[98] = 97 – not equal to BLOCKSIZE
hostSumArray[99] = 209 – not equal to BLOCKSIZE
hostSumArray[100] = 161 – not equal to BLOCKSIZE
hostSumArray[101] = 97 – not equal to BLOCKSIZE
hostSumArray[102] = 209 – not equal to BLOCKSIZE
hostSumArray[103] = 209 – not equal to BLOCKSIZE
hostSumArray[104] = 209 – not equal to BLOCKSIZE
hostSumArray[105] = 129 – not equal to BLOCKSIZE
hostSumArray[106] = 209 – not equal to BLOCKSIZE
hostSumArray[107] = 97 – not equal to BLOCKSIZE
hostSumArray[108] = 97 – not equal to BLOCKSIZE
hostSumArray[109] = 97 – not equal to BLOCKSIZE
hostSumArray[110] = 209 – not equal to BLOCKSIZE
hostSumArray[111] = 129 – not equal to BLOCKSIZE
hostSumArray[112] = 209 – not equal to BLOCKSIZE
hostSumArray[113] = 209 – not equal to BLOCKSIZE
hostSumArray[114] = 209 – not equal to BLOCKSIZE
hostSumArray[115] = 97 – not equal to BLOCKSIZE
hostSumArray[116] = 129 – not equal to BLOCKSIZE
hostSumArray[117] = 97 – not equal to BLOCKSIZE
hostSumArray[118] = 97 – not equal to BLOCKSIZE
hostSumArray[119] = 97 – not equal to BLOCKSIZE
hostSumArray[148] = 225 – not equal to BLOCKSIZE
hostSumArray[151] = 225 – not equal to BLOCKSIZE
hostSumArray[155] = 225 – not equal to BLOCKSIZE
hostSumArray[156] = 209 – not equal to BLOCKSIZE
hostSumArray[157] = 209 – not equal to BLOCKSIZE
hostSumArray[159] = 225 – not equal to BLOCKSIZE
hostSumArray[161] = 97 – not equal to BLOCKSIZE
hostSumArray[162] = 129 – not equal to BLOCKSIZE
hostSumArray[166] = 161 – not equal to BLOCKSIZE
hostSumArray[171] = 97 – not equal to BLOCKSIZE
hostSumArray[173] = 129 – not equal to BLOCKSIZE
hostSumArray[180] = 193 – not equal to BLOCKSIZE
hostSumArray[181] = 177 – not equal to BLOCKSIZE
hostSumArray[236] = 97 – not equal to BLOCKSIZE
hostSumArray[237] = 129 – not equal to BLOCKSIZE
hostSumArray[238] = 209 – not equal to BLOCKSIZE
hostSumArray[239] = 209 – not equal to BLOCKSIZE
hostSumArray[241] = 145 – not equal to BLOCKSIZE
hostSumArray[245] = 225 – not equal to BLOCKSIZE
hostSumArray[250] = 193 – not equal to BLOCKSIZE
hostSumArray[255] = 129 – not equal to BLOCKSIZE
hostSumArray[258] = 193 – not equal to BLOCKSIZE
hostSumArray[259] = 193 – not equal to BLOCKSIZE
hostSumArray[262] = 193 – not equal to BLOCKSIZE
hostSumArray[266] = 129 – not equal to BLOCKSIZE
hostSumArray[270] = 242 – not equal to BLOCKSIZE
hostSumArray[271] = 145 – not equal to BLOCKSIZE
hostSumArray[286] = 97 – not equal to BLOCKSIZE
hostSumArray[290] = 193 – not equal to BLOCKSIZE
hostSumArray[291] = 97 – not equal to BLOCKSIZE
hostSumArray[294] = 97 – not equal to BLOCKSIZE
hostSumArray[295] = 193 – not equal to BLOCKSIZE
hostSumArray[297] = 193 – not equal to BLOCKSIZE
hostSumArray[299] = 97 – not equal to BLOCKSIZE
hostSumArray[300] = 145 – not equal to BLOCKSIZE
hostSumArray[304] = 179 – not equal to BLOCKSIZE
hostSumArray[306] = 225 – not equal to BLOCKSIZE
hostSumArray[307] = 193 – not equal to BLOCKSIZE
hostSumArray[316] = 225 – not equal to BLOCKSIZE
hostSumArray[318] = 193 – not equal to BLOCKSIZE
hostSumArray[319] = 81 – not equal to BLOCKSIZE
hostSumArray[329] = 225 – not equal to BLOCKSIZE
hostSumArray[330] = 145 – not equal to BLOCKSIZE
hostSumArray[335] = 129 – not equal to BLOCKSIZE
hostSumArray[336] = 129 – not equal to BLOCKSIZE
hostSumArray[339] = 97 – not equal to BLOCKSIZE
hostSumArray[349] = 65 – not equal to BLOCKSIZE
hostSumArray[361] = 198 – not equal to BLOCKSIZE
hostSumArray[364] = 97 – not equal to BLOCKSIZE
hostSumArray[375] = 97 – not equal to BLOCKSIZE
hostSumArray[392] = 161 – not equal to BLOCKSIZE
hostSumArray[397] = 129 – not equal to BLOCKSIZE
hostSumArray[407] = 161 – not equal to BLOCKSIZE
hostSumArray[416] = 97 – not equal to BLOCKSIZE
hostSumArray[418] = 129 – not equal to BLOCKSIZE
hostSumArray[425] = 193 – not equal to BLOCKSIZE
hostSumArray[436] = 193 – not equal to BLOCKSIZE
hostSumArray[444] = 193 – not equal to BLOCKSIZE
hostSumArray[447] = 193 – not equal to BLOCKSIZE
hostSumArray[451] = 145 – not equal to BLOCKSIZE
hostSumArray[452] = 33 – not equal to BLOCKSIZE
hostSumArray[454] = 193 – not equal to BLOCKSIZE
hostSumArray[456] = 225 – not equal to BLOCKSIZE
hostSumArray[457] = 129 – not equal to BLOCKSIZE
hostSumArray[458] = 129 – not equal to BLOCKSIZE
hostSumArray[462] = 161 – not equal to BLOCKSIZE
hostSumArray[463] = 97 – not equal to BLOCKSIZE
hostSumArray[465] = 97 – not equal to BLOCKSIZE
hostSumArray[467] = 97 – not equal to BLOCKSIZE
hostSumArray[468] = 129 – not equal to BLOCKSIZE
hostSumArray[469] = 129 – not equal to BLOCKSIZE
hostSumArray[472] = 129 – not equal to BLOCKSIZE
hostSumArray[473] = 161 – not equal to BLOCKSIZE
hostSumArray[474] = 33 – not equal to BLOCKSIZE
hostSumArray[476] = 129 – not equal to BLOCKSIZE
hostSumArray[478] = 97 – not equal to BLOCKSIZE
hostSumArray[479] = 129 – not equal to BLOCKSIZE
hostSumArray[489] = 193 – not equal to BLOCKSIZE
hostSumArray[498] = 193 – not equal to BLOCKSIZE
hostSumArray[499] = 193 – not equal to BLOCKSIZE
hostSumArray[506] = 193 – not equal to BLOCKSIZE
hostSumArray[507] = 193 – not equal to BLOCKSIZE
hostSumArray[508] = 193 – not equal to BLOCKSIZE
hostSumArray[509] = 193 – not equal to BLOCKSIZE
hostSumArray[512] = 145 – not equal to BLOCKSIZE
hostSumArray[515] = 161 – not equal to BLOCKSIZE
hostSumArray[520] = 149 – not equal to BLOCKSIZE
hostSumArray[526] = 225 – not equal to BLOCKSIZE
hostSumArray[530] = 149 – not equal to BLOCKSIZE
hostSumArray[531] = 149 – not equal to BLOCKSIZE
hostSumArray[544] = 225 – not equal to BLOCKSIZE
hostSumArray[545] = 193 – not equal to BLOCKSIZE
hostSumArray[552] = 129 – not equal to BLOCKSIZE
hostSumArray[553] = 129 – not equal to BLOCKSIZE
hostSumArray[554] = 129 – not equal to BLOCKSIZE
hostSumArray[555] = 129 – not equal to BLOCKSIZE
hostSumArray[557] = 193 – not equal to BLOCKSIZE
hostSumArray[560] = 193 – not equal to BLOCKSIZE
hostSumArray[561] = 193 – not equal to BLOCKSIZE
hostSumArray[562] = 161 – not equal to BLOCKSIZE
hostSumArray[563] = 129 – not equal to BLOCKSIZE
hostSumArray[564] = 129 – not equal to BLOCKSIZE
hostSumArray[565] = 129 – not equal to BLOCKSIZE
hostSumArray[568] = 193 – not equal to BLOCKSIZE
hostSumArray[578] = 129 – not equal to BLOCKSIZE
hostSumArray[582] = 145 – not equal to BLOCKSIZE
hostSumArray[583] = 145 – not equal to BLOCKSIZE
hostSumArray[584] = 145 – not equal to BLOCKSIZE
hostSumArray[585] = 145 – not equal to BLOCKSIZE
hostSumArray[586] = 161 – not equal to BLOCKSIZE
hostSumArray[588] = 161 – not equal to BLOCKSIZE
hostSumArray[589] = 161 – not equal to BLOCKSIZE
hostSumArray[593] = 145 – not equal to BLOCKSIZE
hostSumArray[594] = 145 – not equal to BLOCKSIZE
hostSumArray[595] = 145 – not equal to BLOCKSIZE
hostSumArray[596] = 161 – not equal to BLOCKSIZE
hostSumArray[597] = 161 – not equal to BLOCKSIZE

Sarnath · June 16, 2009, 8:20am

Continuing…

I dumped the POINTS array to global memory for an erring block (in my setup, with 61 blocks running, block 0 was constantly failing… So I just dumped the POINTS array corresponding to block 0)

a) I find that after some number, all elements of POINTS array are just plain 0.
B) This usually starts from "n*WARP_SIZE + 1"th index. But this is NOT the case always. Atleast once, I saw that index starting from 79.

cvnguyen · June 16, 2009, 8:24am

Thanks for your tireless tries, Sarnath. Hope you find the key soon.

tmurray · June 16, 2009, 8:25am

do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

	   if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

how is that not a giant race condition (namely the conditional)? what happens if you do something like

do

	{

		int modifier = 0;

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

	   if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			modifier = 1;

			changed = 1;

		}

		__syncthreads();

		points[threadIdx.x + 1] += modifier;

		__syncthreads();

	} while (changed);

I haven’t tried this (I should be asleep) but at least this doesn’t have that giant race condition. it seems like you’re depending on certain vagaries of intra-SM scheduling in your implementation which is bad and don’t do that, but maybe I’m missing something in your code because it’s 1:30 (like if this should actually be a reduction and sweep across no matter what the order of initial writes is).

Sarnath · June 16, 2009, 8:29am

Tim,

The race exists as I pointed before. But the race is innocent. It does not affect the final output…

What he attempts to create is a ripple of ones starting from 0th index… The race may cause this ripple move faster… THat all. The final output should rmeain the same…

neway, Good night! See you tomorrow.

@CVN,

Not a problem at all. i somehow think there is some problem hiding inside shared mem. We had the same thing with shared mem atomics for which we dont have a convincing explanation… Am hoping this thread would bring it out…

cvnguyen · June 16, 2009, 8:41am

Thanks, tmurray, I’ll try your code.

Actually I used a similar thing for marking connected nodes in a graph. The first ‘1’ value is the initial node. In the end of the program, all ‘1’ values correspond to the nodes connected with the initial node. You see that I only need the final state of the array [font=“Courier New”]points[/font] regardless the executing order of threads.

tmurray · June 16, 2009, 8:45am

nope, my code still breaks (I’m bad at this sleeping thing). I’m going to work on this in the morning.

cvnguyen · June 16, 2009, 10:01am

Ah, I raised the problem in another topic last month ([url=“http://forums.nvidia.com/index.php?showtopic=96829”]http://forums.nvidia.com/index.php?showtopic=96829[/url]). No meaningful reply was there, probably because my original code is much more complicated than the replicated version here.

Topic		Replies	Views
Bug report: Incorrect block scheduling CUDA Programming and Performance	18	7883	February 19, 2010
Mysteriously changing shared memory variable Race condition? Overwriting an array? CUDA Programming and Performance	12	6154	November 25, 2011
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4148	August 26, 2010
Results differ when compiled with sm_10 and sm_20 CUDA Programming and Performance	19	2610	June 17, 2011
this code resets my computer CUDA Programming and Performance	21	7471	March 29, 2008
Running into an endless loop, but why? The strange behaviour of __syncthreads() CUDA Programming and Performance	1	1674	January 27, 2010
threads in a loop threads go missing CUDA Programming and Performance	13	8503	September 9, 2008
Shared mem atomics Repeat topic CUDA Programming and Performance	47	9012	December 1, 2009
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3252	October 5, 2015
Possible nvcc bug? CUDA Programming and Performance	13	8915	January 9, 2011

Possible CUDA bug for multiple active blocks

Related topics