cufft and OpenMP gives problems

Hello all,

I’m trying to use cufft, but have a problem. I need to do many crosscorrelations, and do this using 2D fft’s. The program is compiled with openmp support.
If I run the program with only one thread, everything is fine. If I try to use more threads, then at one point two plans will be made with identical handles, and
the cufft lib will start producing error messages. So far, I’ve tried to add omp critical sections (locking), I’ve tried to create unique streams for all plans, nothing works.

Note that the cufft lib just has a .h header, the rest of the program is completely unaware of CUDA, and it’s compiled with g++.

Any suggestions ???

Many thanks in advance,
F Beekhof

P.S. I’ve posted this earlier on http://forums.nvidia.com/index.php?showtop…2639&hl=fft , but I guess this forum is more appropriate, so re-posting here…

Hello all,

I’m trying to use cufft, but have a problem. I need to do many crosscorrelations, and do this using 2D fft’s. The program is compiled with openmp support.
If I run the program with only one thread, everything is fine. If I try to use more threads, then at one point two plans will be made with identical handles, and
the cufft lib will start producing error messages. So far, I’ve tried to add omp critical sections (locking), I’ve tried to create unique streams for all plans, nothing works.

Note that the cufft lib just has a .h header, the rest of the program is completely unaware of CUDA, and it’s compiled with g++.

Any suggestions ???

Many thanks in advance,
F Beekhof

P.S. I’ve posted this earlier on http://forums.nvidia.com/index.php?showtop…2639&hl=fft , but I guess this forum is more appropriate, so re-posting here…

Extra Information:

Ubuntu 10.4 (64 bits)
CUDA 3.1
NVIDIA X Driver 260.24

Hardware:
nVidia Corporation G84 [Quadro FX 570] rev 161, Mem 256Mb
Intel® Core™2 CPU 6400

Extra Information:

Ubuntu 10.4 (64 bits)
CUDA 3.1
NVIDIA X Driver 260.24

Hardware:
nVidia Corporation G84 [Quadro FX 570] rev 161, Mem 256Mb
Intel® Core™2 CPU 6400

Sounds as if you should associate your plans with (different) streams? Can’t tell for sure, haven’t tried (cufftSetStream()). So you might create n streams, and associate each handle object with the appropriate stream?

Let us know if this works at all?

geeftniethoor!

Sounds as if you should associate your plans with (different) streams? Can’t tell for sure, haven’t tried (cufftSetStream()). So you might create n streams, and associate each handle object with the appropriate stream?

Let us know if this works at all?

geeftniethoor!

Using streams doesn’t help (I just tried it again)… This is logical, because the calls to ‘cufftPlan2d()’ happen /before/ they are associated with streams, and two of those calls return the same handle. That should be impossible,

each call should return a unique handle, or fail.

It should not even be a concurrency issue because the call to the function that allocates device memory, creates the plan and creates and associates the stream is protected by a lock:

In pseudocode:

#pragma omp critical(CVMLCPP_FFT)

my_plan = do_cudamalloc_and_make_fftplan_and_create_and_associate_stre

am_to_set_up_plan_struct( required parameters );

Where my_plan looks like this:

struct FFTPlan

{

typedef cufftHandle plan_type;

FFTPlan() : plan(-1),

		in(0), out(0), cu_in(0), cu_out(0),

		mem_size_in(0), mem_size_out(0), sign(0),

		status(CUFFT_INVALID_PLAN), stream(0) { }

plan_type plan;

void *in, *out; // Main Memory

void *cu_in, *cu_out; // Cuda memory on the device

std::size_t mem_size_in, mem_size_out;

int sign; // C2C only: Forward or Backward transform ?

cufftType type;

bool ok() const { return CUFFT_SUCCESS == status; }

cufftResult status;

cudaStream_t stream;

};

Again, if the program has only 1 thread, i.e. OMP_NUM_THREADS == 1, then all is fine!

Using streams doesn’t help (I just tried it again)… This is logical, because the calls to ‘cufftPlan2d()’ happen /before/ they are associated with streams, and two of those calls return the same handle. That should be impossible,

each call should return a unique handle, or fail.

It should not even be a concurrency issue because the call to the function that allocates device memory, creates the plan and creates and associates the stream is protected by a lock:

In pseudocode:

#pragma omp critical(CVMLCPP_FFT)

my_plan = do_cudamalloc_and_make_fftplan_and_create_and_associate_stre

am_to_set_up_plan_struct( required parameters );

Where my_plan looks like this:

struct FFTPlan

{

typedef cufftHandle plan_type;

FFTPlan() : plan(-1),

		in(0), out(0), cu_in(0), cu_out(0),

		mem_size_in(0), mem_size_out(0), sign(0),

		status(CUFFT_INVALID_PLAN), stream(0) { }

plan_type plan;

void *in, *out; // Main Memory

void *cu_in, *cu_out; // Cuda memory on the device

std::size_t mem_size_in, mem_size_out;

int sign; // C2C only: Forward or Backward transform ?

cufftType type;

bool ok() const { return CUFFT_SUCCESS == status; }

cufftResult status;

cudaStream_t stream;

};

Again, if the program has only 1 thread, i.e. OMP_NUM_THREADS == 1, then all is fine!

OpenMP and CUDA does not work well together. Just using streams isn’t good enough, you have to use the same context.

The problem is that the CUDA context is thread-bound. If you create a context in a thread, all CUDA calls has to come from that thread. Even creating page-locked host memory in a different thread doesn’t work, since if you try then copy it it will act as normal non-pagelocked memory in transfer performance. Launching kernels or such doesn’t work at all.

There are two ways around that. One is to have a CUDA thread. Whenever you have CPU worker jobs to do, launch seperate CPU threads to do the computation but keep the CUDA thread dedicated to doing CUDA calls.

The other is to juggle the context between threads, utilizing a critical section design. The driver API has two functions, cuCtxPopCurrent() and cuCtxPushCurrent() which allows you to disassociate the CUDA context from a thread, and associate it with another. This way you can bounce the context between multiple threads. These two functions are also one of the few Driver API functions you can use in the Runtime API without problems, so both can take advantage of it.

Now, the problem with OpenMP is that it generally uses a pool of threads, and selects one to do the next job. Might still be possible to implement in OpenMP, I’m not an expert, but something like pthreads or boost threads where you can better manage the threads is better suited, since not a threads are equal. There might be additional issues with OpenMP I’m not aware of, a forum seach would help.

As for HOW to actually implement this, I’m not sure. I played around with it, failed and ultimately decided against it due to my single-threaded implementation being good enough.

OpenMP and CUDA does not work well together. Just using streams isn’t good enough, you have to use the same context.

The problem is that the CUDA context is thread-bound. If you create a context in a thread, all CUDA calls has to come from that thread. Even creating page-locked host memory in a different thread doesn’t work, since if you try then copy it it will act as normal non-pagelocked memory in transfer performance. Launching kernels or such doesn’t work at all.

There are two ways around that. One is to have a CUDA thread. Whenever you have CPU worker jobs to do, launch seperate CPU threads to do the computation but keep the CUDA thread dedicated to doing CUDA calls.

The other is to juggle the context between threads, utilizing a critical section design. The driver API has two functions, cuCtxPopCurrent() and cuCtxPushCurrent() which allows you to disassociate the CUDA context from a thread, and associate it with another. This way you can bounce the context between multiple threads. These two functions are also one of the few Driver API functions you can use in the Runtime API without problems, so both can take advantage of it.

Now, the problem with OpenMP is that it generally uses a pool of threads, and selects one to do the next job. Might still be possible to implement in OpenMP, I’m not an expert, but something like pthreads or boost threads where you can better manage the threads is better suited, since not a threads are equal. There might be additional issues with OpenMP I’m not aware of, a forum seach would help.

As for HOW to actually implement this, I’m not sure. I played around with it, failed and ultimately decided against it due to my single-threaded implementation being good enough.