Possible CUDA bug for multiple active blocks

has anyone repro’d this on a non-Windows machine? I tested on Vista 64 and repro’d it on my GTX 280 but I’m unable to on my stock 2.2 installation on Ubuntu 9.04 64-bit with a C1060.

also works on my mac. okay that narrows it down.

The problem is reproducible on my Linux desktop, Kubuntu 8.04LTS 64-bit, CUDA 2.1, for both GTX260 GPU and 8200 mGPU.

CUDA 2.1 doesn’t really help me here. (and I could repro it on my Vista machine at home but not my Server 08 machine at work? argh blargh)

edit: okay now I got it to repro when I increased the number of blocks to 65535. still weird, because at home it was repro’ing every time with 31…

I am so scared with the 2.2 beta driver, so I still use CUDA 2.1 on my Linux desktop.

Yes, the problem appears when the grid size is larger than the number of MPs available, but you can try with a much larger grid for sure (let’s say 300). However, the algorithm should work properly (i.e. Min sum = BLOCKSIZE) for any grid size if the shared memory in used is larger than 8192 bytes.

I can’t reproduce it on my 2.2 installation running Ubuntu 9.04 64-bit with a Quadro FX 1600M.

N.

Nico, Did you tinker the GRIDSIZE variable in the original code while testing? When the number of blocks spawned are less, the problem does NOT show up consistently. If u increase it to 100 or 1000, the problem shows up. Moreover, I run the executable multiple times over a loop for it to show-up. Sometimes, first runs tend to be good.

btw, I am able to reproduce with WinXP 32-bit Professional SP3, CUDA 2.2, TESLA C1060 (which was bought)

Best regards,

Sarnath

CVN,

I think I have found another race condition which would actually cause a hang. But somehow, it is not showing up here…

I just found it when I was about to fall asleep yesterday night…

Here is my doubt:

do

	{

		if (!threadIdx.x)

			changed = 0;

		__syncthreads();

		

		if (points[threadIdx.x] != points[threadIdx.x + 1])

		{

			points[threadIdx.x + 1] = 1;

			changed = 1;

		}

		__syncthreads();

	} while (changed);

Look at “while(changed)” – when a few threads are checking it — thread ID 0 might go past it and change that variable to 0 in the top of do-while loop… – which is a race – Some threads can EXIT the do-while loop and possibly the kernel (depending on warp scheduler).

May b, __syncthreads() does not wait for exited threads… ?? But it is not clear what that behaviour would be. Although Tim insists on all threads of the block reaching syncthreads (otherwise un-defined bhaviour)

And if this race striked, then it must ideally result in a hang in syncthreads. Dont know why it is NOT hapening though. What do you say?

Also, I am trying to change the BLOCKSIZE to find whether lesser-size blocks are safe (Atleast blocksize of 32 worked yesterday). I will do more test on these lines and post results

OH YES!!! I FIXED IT!!!

Just put a “__syncthreads” on top of DO_WHILE to fix that race… IT WORKS! I checked it here on my TESLA C1060

ahhhh, excellent. was looking at this, noticed that it always broke across warp boundaries, and decided to think about it tomorrow.

Thanks!

Somtimes, GPU programming is not as simple and straightforward as it might seem.

We have seen similar issues like this… but fortunately, fixed them in code review.

Good job, Sarnath. That fix also works for my original code.

Wow, that’s subtle… but indeed that’s a code bug! The while() conditional at the end probably threw us all off since you don’t think of it as being connected to the code at the top of the loop, but of course it is.

Now I want to go look at some of my own code where I test flags similar to that!

Sarnath: excellent catch, that was tricky and you were the only one to spot it.

CVN, SPWorley,

Thanks!

We have faced these issues before. It is not obvious at all.

Yep, Increased it to 10000 and ran it multiple times, result stays at 256 wit no error.

But I see you already tracked down the problem. Inserting the syncthreads on top of the while

loop sounds like a clean solution to eliminate the race condition. Well spotted :)

N.

Sarnath, I have to admit it was very bright of you to make this discovery.

After I’ve read your post, I recalled I had a very similar issue in my code, which I fixed in the same exact way, as you proposed. I discovered the issue pretty unexpectedly, after all my unit tests worked perfectly on the GPU, in several places, by running my code in the emulation mode. According to my observations (Linux/CUDA-2.1), the emulation scheduler and the GPU scheduler behave quite differently. The emulation scheduler typically runs each thread until the thread hits __syncthreads() (those are executed way more frequently, than the pthread time slicing interval in my code), and the GPU scheduler typically switches the threads on almost every instruction. So, the “across the loop race” was not manifesting itself in the GPU runs, but caused a very well reproducible failure of my unit test or an assert in the emulation mode.