I am having trouble after I do some computation on my GPU, and then want to process data in Matlab. This is the setup:
In Matlab I call a mex function, which invokes several kernels, and loads resulting data from GPU back to Matlab;
I run matrix inversion in Matlab.
The problem is that the matrix inversion no longer runs on all 4 processors of my QuadCore (which it normally does). In Windows Task Manager I can see that the inversion runs only on one processor, the other three are idling. The result is a long computation time.
I am using CUDA 2.1, Matlab 2007a, Intel Core2 Quad, GeForce GTX 280
Does anyone have any idea what can be causing this??
If there are no calls to GPU (or prior to any calls to GPU), matrix inversion in my Matlab utilizes all four processors. So the preferences regarding multithreading seem to be set correctly. However, I tried to play with them. I made the GPU call (after which the matrix inversion ran on one processor only), and then I tried to disable multithreading and enable it back, or change the number of threads Matlab should use, but to avail. Only one processor was utilized. It was only after I closed Matlab and opened it again, the matrix inversion came back to running on all four processors.
I have not tried different versions of Matlab though as I have only have 2007a.
Today I tried with Matlab 2008a. The behavior was the same. I tried to set maxNumCompThreads(4) after the CUDA call, but with no improvement. It seems that calling cudaMalloc() is enough to cause this. Is anyone else experiencing this or is it just me??
I just figured out how to solve it. It would seem that CUDA calls are changing processor affinity of the process, so that it runs on a single processor. Once you change the affinity to its original state, everything is OK (i.e. my matrix inversion gets back to running on all four processors).
Hmm, this might be because certain CUDA calls will spin the processor while a kernel is still running. Doing so prevents the process from migrating constantly.
You are using the runtime API? It might be that this does not happen with the driver API. Also which version of CUDA are you using?
I have a simulation that used all 4 cores (and also used CUDA), that is now only running on 2 cores. And I think nothing changed apart from the CUDA version (I did not recompile the .mexa64 file)
The change of the affinity, however, happens already after cudaMalloc call. My personal understanding is that the higher processor usage appears only during the memory transfers (cudaMemcpy), when the processor copies data between one of the page locked memories, and the host memory where a user has or want to get their data (I am not 100% sure though).
It may be that for some reason it is convenient to pin the process with CUDA calls to one processor; however, I would expect that after CUDA is done, it would change the affinity mask back to its original state. Perhaps, this is how it is done in the Linux driver. nVidia people would be more fit to explain.