Possible CUDA bug for multiple active blocks

Hi everyone,

I would like to replicate a weird CUDA bug here. It causes problems when there are more than one active block in a MP.

The full test code is as following:

#include <stdio.h>

#include <stdlib.h>

#define GRIDSIZE 28

#define BLOCKSIZE 256

__device__ int sumval = 2000;

typedef struct {

#if (0)

	int buf[1786];

#else

	int buf[1785];

#endif

	int flag;

} patch_st;

__global__ void mykernel()

{

	volatile __shared__ patch_st patch;

	volatile __shared__ int points[BLOCKSIZE + 1];

	volatile __shared__ int changed;

	// Initialize values

	if (!threadIdx.x)

	{

		points[0] = 1;

		points[BLOCKSIZE] = 0;

		patch.flag = 1;

	}

	else

		points[threadIdx.x] = 0;

	__syncthreads();

	do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

		if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

	if (!threadIdx.x)

	{

		int sum = 0, i;

		for (i = 0; i < BLOCKSIZE; i++)

			sum += points[i];

		atomicMin(&sumval, sum);

		patch.flag = 0;

	}

}

int main(void)

{

	dim3 grid, block;

	int hostsum;

	grid.x = GRIDSIZE;

	block.x = BLOCKSIZE;

	//cudaSetDevice(0);

	mykernel<<<grid, block>>>();

	cudaThreadSynchronize();

	printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

	cudaMemcpyFromSymbol(&hostsum, "sumval", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("Min sum = %d\n", hostsum);

	

	return 0;

}

The struct [font=“Courier New”]patch_st [/font]is for controlling the size of shared memory space occupied by each block (by varying the size of [font=“Courier New”]buf[/font] field).

First of all, I set the value of GRIDSIZE to the number of multiprocessors available on the GPU (27 for my GLX260) and the size of shared memory to 8192 bytes (half of the space available on each MP). With these settings, the program yields a correct result (Min sum = 256). In this case each block is run on one MP, so there is only one active block per MP.

$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu

testbug1.cu

tmpxft_00002348_00000000-3_testbug1.cudafe1.gpu

tmpxft_00002348_00000000-8_testbug1.cudafe2.gpu

ptxas info	: Compiling entry function '_Z8mykernelv'

ptxas info	: Used 2 registers, 8192+8188 bytes smem, 8 bytes cmem[1], 4 bytes

 cmem[14]

tmpxft_00002348_00000000-3_testbug1.cudafe1.cpp

tmpxft_00002348_00000000-13_testbug1.ii

Next, I increase GRIDSIZE while keeping the size of shared memory still 8192 bytes. Now there is a MP running 2 blocks, and these 2 blocks can be run simultaneously (2 active blocks) because each of them requires only a half of shared memory space available (the number of registers in use is quite small enough). With these settings, the program yields random incorrect result (less than 256). When there are more than one active block, the WHILE loop in the above code does not work as expected.

Then, I increase the size of shared memory to 8196 bytes (larger than a half of available space), GRIDSIZE is kept unchanged (28 for my GLX260). In this case, although there is a MP running 2 blocks, these 2 blocks cannot be run simultaneously. With these settings, the program becomes working properly (Min sum = 256).

$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu

testbug1.cu

tmpxft_0000147c_00000000-3_testbug1.cudafe1.gpu

tmpxft_0000147c_00000000-8_testbug1.cudafe2.gpu

ptxas info	: Compiling entry function '_Z8mykernelv'

ptxas info	: Used 2 registers, 8196+8192 bytes smem, 8 bytes cmem[1], 4 bytes

 cmem[14]

tmpxft_0000147c_00000000-3_testbug1.cudafe1.cpp

tmpxft_0000147c_00000000-13_testbug1.ii

The issue also happens on my 8200 mGPU.

I’ve run the test on WinXP 32-bit, CUDA 2.2.

Anybody can explain the issue? I am wondering if this is a hardware/tool bug.

It may be very relevant, but you do not really have 16384 bytes of shared memory. Each block uses a few extra words to store thread and block ID and shape. If you’re asking for 8192 bytes of shared memory, and CUDA is assigning two blocks per MP, then something is wrong since the real use is 8192+overhead and there isn’t really enough shared RAM to support that.

How much per-block shared memory overhead there is is a little mysterious, but I believe it was something like 12 words per block. A very old forum thread from over a year ago discussed this via some sneaky analysis of raw pointer probe experiments.

Now whether or not CUDA is doing something wrong or there’s a problem in your code, I don’t know, but your statement that each block uses 8192 bytes and two blocks run at once immediately throws up “err… that shouldn’t be true” thoughts.

you’re not checking to see if it actually gets launched, and it shouldn’t be. the correct error checking code would be

mykernel<<<grid, block>>>();

if (cudaGetLastError() != cudaSuccess)

{

  printf("error in launching kernel\n");

  return 1;

}

cudaThreadSynchronize();

...

I’ve added your code into the test. The kernel are launched successfully in all cases.

intriguing… okay now I will look into this a lot more.

If I reduce the shared memory size in used (much less than 8192 bytes), the problem will still happen. I select the sizes of 8192 and 8196 just for better visualization.

Hello CVN, Good to see you back in action Mate…

Coming to point… The IF statement in do-while loop that compares – has a race condition at WARP boundaries… No? The way you are updating points[threadIdx.x+1] causes a race… If u r updating only points[threadIdx.x] – then it does NOT matter…

Isnt it?

Sorry, I cannot catch your idea, Sarnath. Could you recommend a working code for that?

Sorry, useless post… (edited)

No, the execution order of threads does not affect the program logic, Sarnath. All threads must synchronize at __syncthreads() right after the while loop.

Sorry CVN. I missed that 2nd syncthreads() out there. Its all fine.

Not sure why you are seeing less than 256. Have u tried just doing “atomicAdd” and see if that works out fine?

Problem reproducible on my TESLA C1060 (that was bought – so no pre-production thing).

For me, I was able to clearly see the prblem when GRIDSIZE becomes 92. It is reproducible at 91 as well… but with lesser frequency.

OK,

I just declard per-block array and outputed the SUM out there…
I find that the SUM computed for certain blockIds are just NOT 256. Even if comment out “atomicMin” out – the problem appears.
Thus, atomics can be safely ruled out of this equation.

The problem occurs with lesser shared memory sizes as well (as claimed by CVN)

The blockIds exhibiting such behaviour in my TESLA are 60,61,90,91. (for total of 92 blocks)
Increasing the blocksize to 600 destroys this pattern and there are lot of blocks that exhibit such behaviour.

A screenshot:
$ ./a
err = no error
Min sum = 2000 /* atomicMin was commented out… Hence */
hostSumArray[97] = 209 – not equal to BLOCKSIZE
hostSumArray[98] = 97 – not equal to BLOCKSIZE
hostSumArray[99] = 209 – not equal to BLOCKSIZE
hostSumArray[100] = 161 – not equal to BLOCKSIZE
hostSumArray[101] = 97 – not equal to BLOCKSIZE
hostSumArray[102] = 209 – not equal to BLOCKSIZE
hostSumArray[103] = 209 – not equal to BLOCKSIZE
hostSumArray[104] = 209 – not equal to BLOCKSIZE
hostSumArray[105] = 129 – not equal to BLOCKSIZE
hostSumArray[106] = 209 – not equal to BLOCKSIZE
hostSumArray[107] = 97 – not equal to BLOCKSIZE
hostSumArray[108] = 97 – not equal to BLOCKSIZE
hostSumArray[109] = 97 – not equal to BLOCKSIZE
hostSumArray[110] = 209 – not equal to BLOCKSIZE
hostSumArray[111] = 129 – not equal to BLOCKSIZE
hostSumArray[112] = 209 – not equal to BLOCKSIZE
hostSumArray[113] = 209 – not equal to BLOCKSIZE
hostSumArray[114] = 209 – not equal to BLOCKSIZE
hostSumArray[115] = 97 – not equal to BLOCKSIZE
hostSumArray[116] = 129 – not equal to BLOCKSIZE
hostSumArray[117] = 97 – not equal to BLOCKSIZE
hostSumArray[118] = 97 – not equal to BLOCKSIZE
hostSumArray[119] = 97 – not equal to BLOCKSIZE
hostSumArray[148] = 225 – not equal to BLOCKSIZE
hostSumArray[151] = 225 – not equal to BLOCKSIZE
hostSumArray[155] = 225 – not equal to BLOCKSIZE
hostSumArray[156] = 209 – not equal to BLOCKSIZE
hostSumArray[157] = 209 – not equal to BLOCKSIZE
hostSumArray[159] = 225 – not equal to BLOCKSIZE
hostSumArray[161] = 97 – not equal to BLOCKSIZE
hostSumArray[162] = 129 – not equal to BLOCKSIZE
hostSumArray[166] = 161 – not equal to BLOCKSIZE
hostSumArray[171] = 97 – not equal to BLOCKSIZE
hostSumArray[173] = 129 – not equal to BLOCKSIZE
hostSumArray[180] = 193 – not equal to BLOCKSIZE
hostSumArray[181] = 177 – not equal to BLOCKSIZE
hostSumArray[236] = 97 – not equal to BLOCKSIZE
hostSumArray[237] = 129 – not equal to BLOCKSIZE
hostSumArray[238] = 209 – not equal to BLOCKSIZE
hostSumArray[239] = 209 – not equal to BLOCKSIZE
hostSumArray[241] = 145 – not equal to BLOCKSIZE
hostSumArray[245] = 225 – not equal to BLOCKSIZE
hostSumArray[250] = 193 – not equal to BLOCKSIZE
hostSumArray[255] = 129 – not equal to BLOCKSIZE
hostSumArray[258] = 193 – not equal to BLOCKSIZE
hostSumArray[259] = 193 – not equal to BLOCKSIZE
hostSumArray[262] = 193 – not equal to BLOCKSIZE
hostSumArray[266] = 129 – not equal to BLOCKSIZE
hostSumArray[270] = 242 – not equal to BLOCKSIZE
hostSumArray[271] = 145 – not equal to BLOCKSIZE
hostSumArray[286] = 97 – not equal to BLOCKSIZE
hostSumArray[290] = 193 – not equal to BLOCKSIZE
hostSumArray[291] = 97 – not equal to BLOCKSIZE
hostSumArray[294] = 97 – not equal to BLOCKSIZE
hostSumArray[295] = 193 – not equal to BLOCKSIZE
hostSumArray[297] = 193 – not equal to BLOCKSIZE
hostSumArray[299] = 97 – not equal to BLOCKSIZE
hostSumArray[300] = 145 – not equal to BLOCKSIZE
hostSumArray[304] = 179 – not equal to BLOCKSIZE
hostSumArray[306] = 225 – not equal to BLOCKSIZE
hostSumArray[307] = 193 – not equal to BLOCKSIZE
hostSumArray[316] = 225 – not equal to BLOCKSIZE
hostSumArray[318] = 193 – not equal to BLOCKSIZE
hostSumArray[319] = 81 – not equal to BLOCKSIZE
hostSumArray[329] = 225 – not equal to BLOCKSIZE
hostSumArray[330] = 145 – not equal to BLOCKSIZE
hostSumArray[335] = 129 – not equal to BLOCKSIZE
hostSumArray[336] = 129 – not equal to BLOCKSIZE
hostSumArray[339] = 97 – not equal to BLOCKSIZE
hostSumArray[349] = 65 – not equal to BLOCKSIZE
hostSumArray[361] = 198 – not equal to BLOCKSIZE
hostSumArray[364] = 97 – not equal to BLOCKSIZE
hostSumArray[375] = 97 – not equal to BLOCKSIZE
hostSumArray[392] = 161 – not equal to BLOCKSIZE
hostSumArray[397] = 129 – not equal to BLOCKSIZE
hostSumArray[407] = 161 – not equal to BLOCKSIZE
hostSumArray[416] = 97 – not equal to BLOCKSIZE
hostSumArray[418] = 129 – not equal to BLOCKSIZE
hostSumArray[425] = 193 – not equal to BLOCKSIZE
hostSumArray[436] = 193 – not equal to BLOCKSIZE
hostSumArray[444] = 193 – not equal to BLOCKSIZE
hostSumArray[447] = 193 – not equal to BLOCKSIZE
hostSumArray[451] = 145 – not equal to BLOCKSIZE
hostSumArray[452] = 33 – not equal to BLOCKSIZE
hostSumArray[454] = 193 – not equal to BLOCKSIZE
hostSumArray[456] = 225 – not equal to BLOCKSIZE
hostSumArray[457] = 129 – not equal to BLOCKSIZE
hostSumArray[458] = 129 – not equal to BLOCKSIZE
hostSumArray[462] = 161 – not equal to BLOCKSIZE
hostSumArray[463] = 97 – not equal to BLOCKSIZE
hostSumArray[465] = 97 – not equal to BLOCKSIZE
hostSumArray[467] = 97 – not equal to BLOCKSIZE
hostSumArray[468] = 129 – not equal to BLOCKSIZE
hostSumArray[469] = 129 – not equal to BLOCKSIZE
hostSumArray[472] = 129 – not equal to BLOCKSIZE
hostSumArray[473] = 161 – not equal to BLOCKSIZE
hostSumArray[474] = 33 – not equal to BLOCKSIZE
hostSumArray[476] = 129 – not equal to BLOCKSIZE
hostSumArray[478] = 97 – not equal to BLOCKSIZE
hostSumArray[479] = 129 – not equal to BLOCKSIZE
hostSumArray[489] = 193 – not equal to BLOCKSIZE
hostSumArray[498] = 193 – not equal to BLOCKSIZE
hostSumArray[499] = 193 – not equal to BLOCKSIZE
hostSumArray[506] = 193 – not equal to BLOCKSIZE
hostSumArray[507] = 193 – not equal to BLOCKSIZE
hostSumArray[508] = 193 – not equal to BLOCKSIZE
hostSumArray[509] = 193 – not equal to BLOCKSIZE
hostSumArray[512] = 145 – not equal to BLOCKSIZE
hostSumArray[515] = 161 – not equal to BLOCKSIZE
hostSumArray[520] = 149 – not equal to BLOCKSIZE
hostSumArray[526] = 225 – not equal to BLOCKSIZE
hostSumArray[530] = 149 – not equal to BLOCKSIZE
hostSumArray[531] = 149 – not equal to BLOCKSIZE
hostSumArray[544] = 225 – not equal to BLOCKSIZE
hostSumArray[545] = 193 – not equal to BLOCKSIZE
hostSumArray[552] = 129 – not equal to BLOCKSIZE
hostSumArray[553] = 129 – not equal to BLOCKSIZE
hostSumArray[554] = 129 – not equal to BLOCKSIZE
hostSumArray[555] = 129 – not equal to BLOCKSIZE
hostSumArray[557] = 193 – not equal to BLOCKSIZE
hostSumArray[560] = 193 – not equal to BLOCKSIZE
hostSumArray[561] = 193 – not equal to BLOCKSIZE
hostSumArray[562] = 161 – not equal to BLOCKSIZE
hostSumArray[563] = 129 – not equal to BLOCKSIZE
hostSumArray[564] = 129 – not equal to BLOCKSIZE
hostSumArray[565] = 129 – not equal to BLOCKSIZE
hostSumArray[568] = 193 – not equal to BLOCKSIZE
hostSumArray[578] = 129 – not equal to BLOCKSIZE
hostSumArray[582] = 145 – not equal to BLOCKSIZE
hostSumArray[583] = 145 – not equal to BLOCKSIZE
hostSumArray[584] = 145 – not equal to BLOCKSIZE
hostSumArray[585] = 145 – not equal to BLOCKSIZE
hostSumArray[586] = 161 – not equal to BLOCKSIZE
hostSumArray[588] = 161 – not equal to BLOCKSIZE
hostSumArray[589] = 161 – not equal to BLOCKSIZE
hostSumArray[593] = 145 – not equal to BLOCKSIZE
hostSumArray[594] = 145 – not equal to BLOCKSIZE
hostSumArray[595] = 145 – not equal to BLOCKSIZE
hostSumArray[596] = 161 – not equal to BLOCKSIZE
hostSumArray[597] = 161 – not equal to BLOCKSIZE

Continuing…

  1. I dumped the POINTS array to global memory for an erring block (in my setup, with 61 blocks running, block 0 was constantly failing… So I just dumped the POINTS array corresponding to block 0)

    a) I find that after some number, all elements of POINTS array are just plain 0.
    B) This usually starts from "n*WARP_SIZE + 1"th index. But this is NOT the case always. Atleast once, I saw that index starting from 79.

Thanks for your tireless tries, Sarnath. Hope you find the key soon.

do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

	   if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

how is that not a giant race condition (namely the conditional)? what happens if you do something like

do

	{

		int modifier = 0;

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

	   if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			modifier = 1;

			changed = 1;

		}

		__syncthreads();

		points[threadIdx.x + 1] += modifier;

		__syncthreads();

	} while (changed);

I haven’t tried this (I should be asleep) but at least this doesn’t have that giant race condition. it seems like you’re depending on certain vagaries of intra-SM scheduling in your implementation which is bad and don’t do that, but maybe I’m missing something in your code because it’s 1:30 (like if this should actually be a reduction and sweep across no matter what the order of initial writes is).

Tim,

The race exists as I pointed before. But the race is innocent. It does not affect the final output…

What he attempts to create is a ripple of ones starting from 0th index… The race may cause this ripple move faster… THat all. The final output should rmeain the same…

neway, Good night! See you tomorrow.

@CVN,

Not a problem at all. i somehow think there is some problem hiding inside shared mem. We had the same thing with shared mem atomics for which we dont have a convincing explanation… Am hoping this thread would bring it out…

Thanks, tmurray, I’ll try your code.

Actually I used a similar thing for marking connected nodes in a graph. The first ‘1’ value is the initial node. In the end of the program, all ‘1’ values correspond to the nodes connected with the initial node. You see that I only need the final state of the array [font=“Courier New”]points[/font] regardless the executing order of threads.

nope, my code still breaks (I’m bad at this sleeping thing). I’m going to work on this in the morning.

Ah, I raised the problem in another topic last month (http://forums.nvidia.com/index.php?showtopic=96829). No meaningful reply was there, probably because my original code is much more complicated than the replicated version here.