Calling CUDA function disables OpenMP? Can they co-exist in the same application?

I have an image processing function that’s implemented for both CUDA and OpenMP. Both implementations run fine when run separately.

Then I created a benchmark to compare processing times for both implementations and I found a problem: once the CUDA implementation has been executed, the OpenMP implementation stops being optimized. Instead of being split into 4 threads, the loop runs on a single thread. The processing time goes up and I can see the CPU usage going down to 25% instead of 100% (I have a 4-cores computer).

What can cause this? I thought the APIs were independent. I successively removed portions of the CUDA code and found that OpenMP become disabled as soon as I call cudaMallocPitch to allocate an image buffer on the device.

If anyone has any kind of insight on what is going on please let me know!

I’m using a GT 240 with driver 197.13 and CUDA 3.0 in Windows XP and Visual Studio 2005. The CUDA implementation is run within a DLL that creates a thread for each GPU found in the computer in order to serialize all requests for that GPU.

That looks weird… I am curious to see if therez a compatibility issue out there…

are you using the profiler? that sets CPU affinity for timing purposes. as does cutil, probably.

(insert your very own “seriously guys don’t use cutil for anything, you don’t know what it actually does” plea here)