Producer-Consumer in CUDA

How to implement the Producer-Consumer Problem in CUDA?

Also, I specifically wanna know that if, suppose, thread 0 is writing a value to a memory location and thread 5 is accessing it at the same time, which happens first?

threads in each thread block are performed concurrently. but thread blocks are started in unspecified order, and it’s possible that only a single thread block is executed by GPU at any given moment. it’s the so-called “task parallelism” model and i suggest you to learn this idea and modify your algorithms to get use of it rather than looking for the ways to emulate on GPU other approaches available for CPU code, even if it’s possible

Some examples:

https://stackoverflow.com/questions/43378708/are-atomic-operations-in-cuda-guaranteed-to-be-scheduled-per-warp/43420507#43420507

https://stackoverflow.com/questions/33150040/doubling-buffering-in-cuda-so-the-cpu-can-operate-on-data-produced-by-a-persiste/33158954#33158954

Note that any sort of persistent kernel should use CUDA cooperative groups feature new in CUDA 9 (“cooperative kernel launch”). The above examples are old and don’t reflect this.

https://devblogs.nvidia.com/parallelforall/cooperative-groups/