I am programming with PTX. Is there a PTX synchronization method that ensures one PTX instruction (instruction A) in a kernel completes execution before a certain PTX instruction (instruction B) in another kernel begins execution? For example, in a scenario where instruction A in kernel A writes data to global memory, and instruction B in kernel B needs to read that data only after kernel A has completely finished writing, how can a synchronization relationship be established between them? Is there such a PTX synchronization instruction? Thank you!
You can fuse both kernels into one kernel, e.g. with a switch case depending on block number, whether to do kernel A or kernel B.
Alternatively you could look into Cuda Graphs
and also look into Dynamic Parallelism.
You could also try to read a flag in global memory (which is set, when the data is finished being written), but I am not sure, if there is a guarantee that an independent kernel B receives the value written by an independent kernel A in finite time.
I am puzzled by the question. For kernels issued into into the same CUDA stream the desired synchronization automatically applies, regardless of the programming language used to implement the kernel.
The original idea seems like it might be suggesting kernel concurrency as a requirement. A basic premise is that CUDA never guarantees that kernels will run concurrently. Therefore making use of one of the existing sync mechanisms makes a lot of sense to me, to make for reliable communication from one kernel to the other. And all the other suggestions from Curefab are possibly worthy considerations, as well.
One other typical way could be to separate kernel A and kernel B into A1, A2, B1, B2 for the kernels before and after the instruction to be synchronized you mentioned, So instead of synchronizing within kernels, you can synchronize the kernels themselves, for which there are facilities.
- Start A1 and B1 at the same time from different streams.
- With Cuda events let the stream of A1 generate an event, when A1 is finished, run A2 in the same stream as A1 (after A1).
- And run B2 in the stream of B1 (after B1), but let it first wait for A1 to finish (wait for the generated event).
Graphically, it is like an X (except that A2 can start before B1 finishes).
If you have to do it several times, start the kernels in a loop; with Cuda graphs you can reduce the invocation time.
If this is not enough, put A and B into the same kernel and create some producer/consumer relationship.
Thank you for your response. I would like to execute these two kernels in parallel. I have already placed the two kernels in different CUDA streams, but I want to synchronize them using PTX instructions.
There are no PTX instructions to do that. Just like there is no facility in CUDA C++ to do that. You would need to use atomics, a device-wide semaphore, or some other indirect memory-based method to do it. I don’t have a PTX recipe for you, for any of that.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.