Unresponsive Polling of Pinned Memory

Hi,

I want to preempt a portion of thread blocks of a kernel, but I’m having trouble polling the pinned memory location that counts the number of concurrent thread blocks. The following is a part of my code to give some idea about what I’m doing. I use atomicSub to keep track of the number of active thread blocks.

__global__ void kernel(arg1, arg2, ... , int grid_size, int *block_index, volatile int max_blocks, int *concurrent_blocks)
{
	__shared__ int blockIdx_x;
	while(blockIdx.x < *max_blocks) {
		if(threadIdx.x == 0) {
			blockIdx_x = atomicAdd(block_index, 1);
		}
		__syncthreads();
		if(blockIdx_x >= grid_size) {
			break;
		}
		// Start of kernel code
		...
		// End of kernel code
	}
	if(threadIdx.x == 0) {
		atomicSub(concurrent_blocks, 1);
	}
}

Basically, I override the blockIdx.x by my own blockIdx_x and launch only a number of thread blocks that fit on the device, and in the while loop grab new logical thread blocks to execute. On the host side, I write to max_blocks to control the number of concurrent thread blocks on the device. Then using a while loop:

while(*concurrent_blocks > target_concurrent_blocks);

I wait in a separate host thread until the desired number of blocks have preempted. The problem is that this mechanism works for some kernels but doesn’t work for some others. I know that the excess thread blocks don’t do more work after I write to max_blocks, because I printed the (blockIdx.x, blockIdx_x) pairs, and those thread blocks with blockIdx.x greater than max_blocks had only one value in the output. So the code is working as it’s supposed to on the device, but I’m not observing the desired behavior by polling on the host side. Is it because of internal warp scheduling on the GPU that doesn’t execute the atomicSub until the very end, or am I doing something wrong?

Thanks for your help.

this is well-known “permanent grids” technique. standard way is to run excessive amount of blocks. extra blocks will start only once all work was done, and will immediately exit. no need of concurrent_blocks machinery at all

This is exactly what I intend to do. But there’s also another part to it. When a new kernel wants to run, I want to make room for that so I need to preempt a bunch of thread blocks of the running kernel, that’s what max_blocks and concurrent_blocks are doing.

You can’t atomically add in mapped host memory unless you use the new atomicAdd_system() of compute capability 6.x. Keep the atomic counter in device memory on earlier GPUs, and copy the result to the mapped memory to give the host an approximate indication of where the device counter stands.

I think that deserves some clarification. You can certainly atomically add to mapped host memory, when the operation is originating from a single GPU. The atomic operations originating from a single GPU will be properly handled (i.e. will remain atomic with respect to each other), as this is entirely under the control of the GPU memory interface.

In this scenario, you should still be able to read the location from the host side and get a coherent value. Note that if this read-only-from-the-host were not reliable, then the proposed advice ("Keep the atomic counter in device memory on earlier GPUs, and copy the result to the mapped memory ") would also be unreliable, when read from the host.

The new _system() api allows for both host and device atomic updates, to the same location, at the “same time”. The RMW operations on each side are guaranteed to be atomic.

The answer here:

https://stackoverflow.com/questions/23193151/atomic-operations-in-cuda-kernels-on-mapped-pinned-host-memory-to-do-or-not-to

was written by Nick Wilt, who I believe is a trustworthy source. The specific case of the producer being a single GPU, and the consumer being the CPU, is deemed to work. For ordinary updates (non-atomic) posted by the GPU, the advice given in that answer to use an appropriate memory barrier to force the earliest possible update to the system memory should be considered also.

I tried this, still no luck. What I don’t get is why the exact same code works for one kernel and doesn’t work for another one, and it’s not random at all.

Adding fences didn’t fix it either. Is my approach fundamentally flawed? I noticed that in the answer in the link you mentioned, it’s said that there are better methods for coordination when GPU is the producer. Could you please guide me to what I should look for?

It’s nearly impossible to tell what may be wrong in code you have not shown. In general, I believe the GPU should be able to produce “real-time” updates to data in global memory (mapped/pinned, or not) that are “consumable” by the CPU to good effect, during kernel execution.

Here’s one such example, that I believe works correctly:

https://stackoverflow.com/questions/20345702/how-can-i-check-the-progress-of-matrix-multiplication/20381924#20381924

Another possibly interesting example code is here:

https://stackoverflow.com/questions/33150040/doubling-buffering-in-cuda-so-the-cpu-can-operate-on-data-produced-by-a-persiste/33158954#33158954

Of course I don’t claim that either of these do exactly what you want.

Note that if you don’t use pinned memory, getting the GPU->CPU communication going should still be possible, but can be (in my experience) extremely tricky for GPUs that are in WDDM mode, due to pesky command batching.

I’m sorry I was just being silly, concurrent_blocks variable needed to be defined as volatile, that fixed it. Thanks for your help.