I have a set of n
kernels that all write to the same global output memory and they may happen in any order. In order to avoid a race condition, I’d like to prevent any pair of kernels from executing at the same time.
The obvious solution is to make the kernels depend on each other. If n=2
and the kernels are A
and B
, either add A->B
or B->A
to the compute graph. The problem with this is that I cannot reliably predict if A will happen first or if B will.
Another solution might be to launch A
and B
with grids so large that kernel concurrency cannot happen. But then, well, kernel concurrency cannot happen elsewhere, either.
I have taken a look at conditional nodes, but it is unclear if something like the following solution would work or would even be valid:
Replace the A
kernel with the following conditional graph:
while(A not finished):
if(B cannot start or B finished):
execute A
if(B can start):
execute A
Replace the B
kernel with the following conditional graph:
while(B not finished):
if(A cannot start or A finished):
execute B
if(A can start):
continue
So you want to have a feature like a mutex for kernel launches across streams?
Correct, or some other way to remove the data race that would happen if multiple kernels are running at the same time.
I also looked at cooperative groups, thinking if kernels could share the same grid, they couldn’t happen at the same time. But that doesn’t seem to be what cooperative groups are for.
Something like cudaLaunchCooperativeKernel
? Does it work well with streams and graphs?
The idea being that the kernels each need all SMs by themselves and block until all are available at the same time.
If you could describe the data race in detail (a simplified example), it would be perhaps easier to suggest alternatives. Why do you not write into independent copies?
You could regulate access (to the race-prone data) with a device-wide semaphore.
My use case are for kernels similar to:
__global__ void f(int *out, int const* inn) {
int inn_idx = get_inn_index(blockIdx, threadIdx);
int out_idx = get_out_index(blockIdx, threadIdx);
out[out_idx] += inn[inn_idx];
}
Typically, the sooner whatever memory is used by inn
can be used for other purposes, the better the application will run.
If using a semaphore serializes the work to a single call to f
, that’d be no good either
Can you clarify more, why different kernels interfere with each other? Why is serializing of f not enough?
So it is not only about the kernels, but also about preparatory work, e.g. you would use the memory of inn for different purposes and it also encompasses memory copies on the CPU, before each kernel is launched?
f
will be called several times for the same output data, each time with different input data. After a call to f
completes, that input data can then be used for something else and subsequent compute can happen.
It looks something like this. The red dotted line denotes that both f
are writing to the same out
.
Why is serializing of f not enough?
Since that output data will be large, writing to it in f
shouldn’t be serialized
So it is not only about the kernels, but also about preparatory work, e.g. you would use the memory of inn for different purposes and it also encompasses memory copies on the CPU, before each kernel is launched?
Everything is on the GPU
Is your goal to prevent a race condition when executing
out[out_idx] += inn[inn_idx];
from different kernels with the same out_idx?
Possibility 1
Have you considered atomicAdd
?
(By using atomicAdd you would not have to do anything more to make it compatible with multiple streams and kernels.)
Possibility 2
Or can you somehow prevent every thread to access any element? E.g. that you have 100 blocks and each block only accesses a portion of the out
array? Then the threads within a block could work it out, e.g. on shared memory.
(This probably would not work with your multiple streams with different kernels, which each want to access the same output array.)
Possibility 3
Write a list of the requested additions into free memory (without reading or adding yet), then sort that list into buckets (partition the out
array into regions), then process each bucket in parallel.
(This would work very well with your graph and multiple streams. The individual streams just create this write log list. And only before accessing out, after all streams are finished, all the lists are processed. The distribution into buckets can be done individually per stream or together afterwards.)
Possibility 4
See my suggestion of using cudaLaunchCooperativeKernel
to only have one kernel running at the same time.
(This would only work, when the single kernels have no conflicts within themselves. It could enable possibility 2.)