I want to have the CPU communicate with an actively-running kernel on the GPU without going through the CUDA driver. I’ve read that unified memory can result in driver memory copy calls under the hood, so I would need to use zero-copy memory to share information in this case.
To confirm that I understand zero-copy memory, I put together this simple experiment. I create a zero-copy integer that the CPU initializes to zero. The CPU then launches a GPU kernel with the zero-copy GPU pointer. The CPU and GPU then go back and forth with each running these two steps:
- Poll rapidly for a value change
- Change the value
The idea is that it confirms the CPU sees changes issued by the GPU and vice versa. It is a short sequence only meant to count 0, 1, 2, 3. For some reason, the sequence works once but locks up if I try running it twice in a row.
Kernel Code:
__global__ void kHandshake(volatile int *handshake)
{
int tid = threadIdx.x;
if (tid == 0)
{
*handshake = 1;
int timeOutCount = 200000000;
// Spinwait for CPU to change handshake away from 1,
// but give up if it takes too long so that the GPU doesn't hang.
while (timeOutCount > 0)
{
timeOutCount--;
if ((*handshake) != 1)
{
*handshake = 3;
break;
}
}
if (timeOutCount == 0)
{
printf("GPU never saw handshake change\n");
}
}
}
CPU Code for Test Sequence
void usingZerocpy()
{
cudaError_t cErr = cudaSuccess;
int *h_handshake;
cErr = cudaHostAlloc(&h_handshake, sizeof(float), cudaHostAllocMapped | cudaHostAllocPortable);
if (cErr) { printf("Ack! An Error!\n"); return; }
*h_handshake = 0;
int *d_handshake;
cErr = cudaHostGetDevicePointer(&d_handshake, h_handshake, 0);
if (cErr) { printf("Ack! An Error!\n"); return; }
printf("Launching kernel\n");
kHandshake<<<1, 1>>>(d_handshake);
volatile int *hs = (volatile int *)h_handshake;
while ((*hs) == 0); // Wait for GPU to change hs to 1
printf("CPU saw GPU state change!\n");
printf("Value of handshake is now %d\n", *hs);
printf("CPU changing state to 2\n");
*hs = 2;
while ((*hs) == 2); // Wait for GPU to change hs to 3
printf("CPU saw GPU state change!\n");
printf("Value of handshake is now %d\n", *hs);
cErr = cudaFreeHost(h_handshake);
}
This works as expected:
usingZerocpy();
However, this locks up on the second call:
usingZerocpy();
usingZerocpy();
This is the console output:
Launching kernel
CPU saw GPU state change!
Value of handshake is now 1
CPU changing state to 2
CPU saw GPU state change!
Value of handshake is now 3
Launching kernel
<this is where it hangs>
If I try debugging through the Nsight plug-in in Visual Studio, the problem disappears; I can call usingZeroCopy() 100 times in a row without a problem in that case. Additionally, it works if I add cudaDeviceReset() before each call to usingZerocpy().
Why doesn’t it work every time in general? This seems like something simple that I am missing.