I noticed producer-consumer parallelism in the blog, but does not find any relavent examples… I am a new CUDA learner, could someone kindly guide me? Thank you!!
Actually I am thinking about use producer to do Matmul and send the inter result to consumer for next Matmul. Especially, I want to use shared memory to send the data, but not global memory…