Batched FFTs not launching concurrently on multiple GPUs

Hey everyone,
I am trying to get into CUDA and I’m playing around with some data.

I’m currently trying to run batched cuFFTs on 4 K80 GPUs where each host thread creates a batched cufftPlan and executes it on a set of data. After that I have a kernel that calculates the magnitude of the fft. The data is read from a global host buffer and cudamemcopyed to each device after cudaSetDevice() is called within the thread.