I created multiple host threads to run the same cufft and kernel functions. For each thread, a separate cudaStream_t stream is created. For cufft functions, I used the cufftSetStream function. However, this doesn’t work.
All cufft kernels are in one stream, while normal kernels are in separate streams as expected. The time line view from nsight system is attached below: