CUFFT with multiple gpus does anyone have experience with this?

I am wondering if any one has used CUFFT in a multithread/multigpu application which performs the CUFFT function over and over many times.

I am working on an application using two threads to control two gpus – I have 2 GX 280s and I’m using CUDA 2.0 with Linux with the 177.73 driver. My program runs in a ‘endless’ loop where the host acquires input data, the two threads process the data on the gpus and the host outputs the processed data. The gpu processing is like this:

		cudamemcopy (databuffer to device)
                    kernel()
                    cufft(plan1)
                    for (int i=0; i<4; i++)
                    {
		cufft(plan2)
                    }
		cudamemcpy (device to databuffer)
                    cudasyncthreads()

When I run it for several minutes to an hour – eventually one of the threads locks up. It doesn’t appear to be deadlocked – my debug statements show it locking up just before the first cudamemcpy. I don’t see any CUFFT errors – it’s as though the gpu is unavailable. I added the cudasyncthreads as an afterthought, it locks up either way.

I’ve gone over my cpu/gpu memory allocation and I’m sure I’m not overwriting my data buffers, also the results I get for short runs are correct.

NVRM version: NVIDIA UNIX x86_64 Kernel Module 177.73
GCC version: gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Thanks for any help/insight!

I would upgrade to either 180.xx or 177.70.xx.

Thanks for the suggestion. I installed CUDA 2.1 with the new 180.22 driver, but my problem persists. I am going to download Valgrind and see if that sheds any light on my problem.

I’m attempting to use valgrind withe the helgrind threading tool to debug my code. I compiled for device emulation and then used valgrind --tool=helgrind mycode. I got some results I didn’t expect. First question – is device emulation valid for multiple threads running multiple gpus (2 threads/2 gpus in my case).

Most of my errors involved possible data races between my two non-host threads while creating fft plans, for example:

==30935== Possible data race during read of size 4 at 0x508d980 by thread #2
==30935== at 0x4E59BFB: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E53093: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E53B85: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E4E8AC: cufftPlan1d (in /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x402CF8: make_cufft_plan(int, int)
==30935== by 0x4017C5: solverThread3(TGPUplan*)
==30935== by 0x4A0AA48: mythread_wrapper (hg_intercepts.c:194)
==30935== by 0x36334062F6: start_thread (in /lib64/libpthread-2.5.so)
==30935== by 0x36328D1E3C: clone (in /lib64/libc-2.5.so)
==30935== This conflicts with a previous write of size 4 by thread #3
==30935== at 0x4E59A51: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E59B94: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E52DD7: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E53C5A: (within /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x4E4E8AC: cufftPlan1d (in /usr/local/cuda/lib/libcufftemu.so.2.1)
==30935== by 0x402CF8: make_cufft_plan(int, int)
==30935== by 0x4017A1: solverThread3(TGPUplan*)
==30935== by 0x4A0AA48: mythread_wrapper (hg_intercepts.c:194)

Each thread creates its own FFT plans with local variables, so there really can’t be a race condition here, unless CUFFT isnt’t thread safe or device emulation mode isn’t applicable to multithread/multigpu. I couldn’t find anything in the programming guide about this. Am I totally confused and clueless here?