I’m trying to use cufft, but have a problem. I need to do many crosscorrelations, and do this using 2D fft’s. The program is compiled with openmp support.
If I run the program with only one thread, everything is fine. If I try to use more threads, then at one point two plans will be made with identical handles, and
the cufft lib will start producing error messages. So far, I’ve tried to add omp critical sections (locking), I’ve tried to create unique streams for all plans, nothing works.
Note that the cufft lib just has a .h header, the rest of the program is completely unaware of CUDA, and it’s compiled with g++.
I’m trying to use cufft, but have a problem. I need to do many crosscorrelations, and do this using 2D fft’s. The program is compiled with openmp support.
If I run the program with only one thread, everything is fine. If I try to use more threads, then at one point two plans will be made with identical handles, and
the cufft lib will start producing error messages. So far, I’ve tried to add omp critical sections (locking), I’ve tried to create unique streams for all plans, nothing works.
Note that the cufft lib just has a .h header, the rest of the program is completely unaware of CUDA, and it’s compiled with g++.
Sounds as if you should associate your plans with (different) streams? Can’t tell for sure, haven’t tried (cufftSetStream()). So you might create n streams, and associate each handle object with the appropriate stream?
Sounds as if you should associate your plans with (different) streams? Can’t tell for sure, haven’t tried (cufftSetStream()). So you might create n streams, and associate each handle object with the appropriate stream?
Using streams doesn’t help (I just tried it again)… This is logical, because the calls to ‘cufftPlan2d()’ happen /before/ they are associated with streams, and two of those calls return the same handle. That should be impossible,
each call should return a unique handle, or fail.
It should not even be a concurrency issue because the call to the function that allocates device memory, creates the plan and creates and associates the stream is protected by a lock:
Using streams doesn’t help (I just tried it again)… This is logical, because the calls to ‘cufftPlan2d()’ happen /before/ they are associated with streams, and two of those calls return the same handle. That should be impossible,
each call should return a unique handle, or fail.
It should not even be a concurrency issue because the call to the function that allocates device memory, creates the plan and creates and associates the stream is protected by a lock:
OpenMP and CUDA does not work well together. Just using streams isn’t good enough, you have to use the same context.
The problem is that the CUDA context is thread-bound. If you create a context in a thread, all CUDA calls has to come from that thread. Even creating page-locked host memory in a different thread doesn’t work, since if you try then copy it it will act as normal non-pagelocked memory in transfer performance. Launching kernels or such doesn’t work at all.
There are two ways around that. One is to have a CUDA thread. Whenever you have CPU worker jobs to do, launch seperate CPU threads to do the computation but keep the CUDA thread dedicated to doing CUDA calls.
The other is to juggle the context between threads, utilizing a critical section design. The driver API has two functions, cuCtxPopCurrent() and cuCtxPushCurrent() which allows you to disassociate the CUDA context from a thread, and associate it with another. This way you can bounce the context between multiple threads. These two functions are also one of the few Driver API functions you can use in the Runtime API without problems, so both can take advantage of it.
Now, the problem with OpenMP is that it generally uses a pool of threads, and selects one to do the next job. Might still be possible to implement in OpenMP, I’m not an expert, but something like pthreads or boost threads where you can better manage the threads is better suited, since not a threads are equal. There might be additional issues with OpenMP I’m not aware of, a forum seach would help.
As for HOW to actually implement this, I’m not sure. I played around with it, failed and ultimately decided against it due to my single-threaded implementation being good enough.
OpenMP and CUDA does not work well together. Just using streams isn’t good enough, you have to use the same context.
The problem is that the CUDA context is thread-bound. If you create a context in a thread, all CUDA calls has to come from that thread. Even creating page-locked host memory in a different thread doesn’t work, since if you try then copy it it will act as normal non-pagelocked memory in transfer performance. Launching kernels or such doesn’t work at all.
There are two ways around that. One is to have a CUDA thread. Whenever you have CPU worker jobs to do, launch seperate CPU threads to do the computation but keep the CUDA thread dedicated to doing CUDA calls.
The other is to juggle the context between threads, utilizing a critical section design. The driver API has two functions, cuCtxPopCurrent() and cuCtxPushCurrent() which allows you to disassociate the CUDA context from a thread, and associate it with another. This way you can bounce the context between multiple threads. These two functions are also one of the few Driver API functions you can use in the Runtime API without problems, so both can take advantage of it.
Now, the problem with OpenMP is that it generally uses a pool of threads, and selects one to do the next job. Might still be possible to implement in OpenMP, I’m not an expert, but something like pthreads or boost threads where you can better manage the threads is better suited, since not a threads are equal. There might be additional issues with OpenMP I’m not aware of, a forum seach would help.
As for HOW to actually implement this, I’m not sure. I played around with it, failed and ultimately decided against it due to my single-threaded implementation being good enough.