Why Does This Zero-Copy Sequence Lock?

I want to have the CPU communicate with an actively-running kernel on the GPU without going through the CUDA driver. I’ve read that unified memory can result in driver memory copy calls under the hood, so I would need to use zero-copy memory to share information in this case.

To confirm that I understand zero-copy memory, I put together this simple experiment. I create a zero-copy integer that the CPU initializes to zero. The CPU then launches a GPU kernel with the zero-copy GPU pointer. The CPU and GPU then go back and forth with each running these two steps:

  1. Poll rapidly for a value change
  2. Change the value

The idea is that it confirms the CPU sees changes issued by the GPU and vice versa. It is a short sequence only meant to count 0, 1, 2, 3. For some reason, the sequence works once but locks up if I try running it twice in a row.

Kernel Code:

__global__ void kHandshake(volatile int *handshake)
{
	int tid = threadIdx.x;
	if (tid == 0)
	{
		*handshake = 1;

		int timeOutCount = 200000000;
		// Spinwait for CPU to change handshake away from 1,
		// but give up if it takes too long so that the GPU doesn't hang.
		while (timeOutCount > 0)
		{
			timeOutCount--;
			if ((*handshake) != 1)
			{
				*handshake = 3;
				break;
			}
		}
		if (timeOutCount == 0)
		{
			printf("GPU never saw handshake change\n");
		}
	}
}

CPU Code for Test Sequence

void usingZerocpy()
{
	cudaError_t cErr = cudaSuccess;

	int *h_handshake;
	cErr = cudaHostAlloc(&h_handshake, sizeof(float), cudaHostAllocMapped | cudaHostAllocPortable);
	if (cErr) { printf("Ack! An Error!\n"); return; }

	*h_handshake = 0;	

	int *d_handshake;
	cErr = cudaHostGetDevicePointer(&d_handshake, h_handshake, 0);
	if (cErr) { printf("Ack! An Error!\n"); return; }

	printf("Launching kernel\n");
	kHandshake<<<1, 1>>>(d_handshake);

	volatile int *hs = (volatile int *)h_handshake;

	while ((*hs) == 0);	// Wait for GPU to change hs to 1

	printf("CPU saw GPU state change!\n");
	printf("Value of handshake is now %d\n", *hs);

	printf("CPU changing state to 2\n");
	*hs = 2;
	while ((*hs) == 2); // Wait for GPU to change hs to 3

	printf("CPU saw GPU state change!\n");
	printf("Value of handshake is now %d\n", *hs);

	cErr = cudaFreeHost(h_handshake);
}

This works as expected:

usingZerocpy();

However, this locks up on the second call:

usingZerocpy();
usingZerocpy();

This is the console output:

Launching kernel
CPU saw GPU state change!
Value of handshake is now 1
CPU changing state to 2
CPU saw GPU state change!
Value of handshake is now 3
Launching kernel
<this is where it hangs>

If I try debugging through the Nsight plug-in in Visual Studio, the problem disappears; I can call usingZeroCopy() 100 times in a row without a problem in that case. Additionally, it works if I add cudaDeviceReset() before each call to usingZerocpy().

Why doesn’t it work every time in general? This seems like something simple that I am missing.