I implemented regular pinned memory, memory zerocopy and unified address space for a FFT operation using CUFFT.
Everything works fine until today. Today the zerocopy and unified addressing suddenly become slow (about 3 times slower). There are four gpu devices in my server and I have been using device 0. If I set the gpu device to 2 or 3, the speed becomes normal again.
I checked the nvidia-smi, the clock speed for all four devices are exactly the same.
What could be the cause?