I have an image processing function that’s implemented for both CUDA and OpenMP. Both implementations run fine when run separately.
Then I created a benchmark to compare processing times for both implementations and I found a problem: once the CUDA implementation has been executed, the OpenMP implementation stops being optimized. Instead of being split into 4 threads, the loop runs on a single thread. The processing time goes up and I can see the CPU usage going down to 25% instead of 100% (I have a 4-cores computer).
What can cause this? I thought the APIs were independent. I successively removed portions of the CUDA code and found that OpenMP become disabled as soon as I call cudaMallocPitch to allocate an image buffer on the device.
If anyone has any kind of insight on what is going on please let me know!
I’m using a GT 240 with driver 197.13 and CUDA 3.0 in Windows XP and Visual Studio 2005. The CUDA implementation is run within a DLL that creates a thread for each GPU found in the computer in order to serialize all requests for that GPU.