CUDA slows Matlab down After GPU computation Matlab does not use all 4 processors

I am having trouble after I do some computation on my GPU, and then want to process data in Matlab. This is the setup:

  • In Matlab I call a mex function, which invokes several kernels, and loads resulting data from GPU back to Matlab;
  • I run matrix inversion in Matlab.

The problem is that the matrix inversion no longer runs on all 4 processors of my QuadCore (which it normally does). In Windows Task Manager I can see that the inversion runs only on one processor, the other three are idling. The result is a long computation time.

I am using CUDA 2.1, Matlab 2007a, Intel Core2 Quad, GeForce GTX 280

Does anyone have any idea what can be causing this??

  • did you change your preferences in Matlab?

  • did you change matlab version?

I see big differences between matlab versions in how good the calculations are distributed over CPUs.

On quadcore with FC8, 2007b I get all 4 cores utilized

On VMware VM with 4 cores, Centos 5.2, 2008b, I get only 1 core utilized.

If there are no calls to GPU (or prior to any calls to GPU), matrix inversion in my Matlab utilizes all four processors. So the preferences regarding multithreading seem to be set correctly. However, I tried to play with them. I made the GPU call (after which the matrix inversion ran on one processor only), and then I tried to disable multithreading and enable it back, or change the number of threads Matlab should use, but to avail. Only one processor was utilized. It was only after I closed Matlab and opened it again, the matrix inversion came back to running on all four processors.

I have not tried different versions of Matlab though as I have only have 2007a.

Today I tried with Matlab 2008a. The behavior was the same. I tried to set maxNumCompThreads(4) after the CUDA call, but with no improvement. It seems that calling cudaMalloc() is enough to cause this. Is anyone else experiencing this or is it just me??

Well, this looks like a matlab problem, so I would suggest asking the mathworks tech support (they are really good)

Actually, I would say it is a CUDA problem.

I just figured out how to solve it. It would seem that CUDA calls are changing processor affinity of the process, so that it runs on a single processor. Once you change the affinity to its original state, everything is OK (i.e. my matrix inversion gets back to running on all four processors).

This is how I did it:

viod mexFunction(...) {

HANDLE h=GetCurrentProcess();

DWORD_PTR PM,SM;

GetProcessAffinityMask(h,&PM,&SM);

.

.

.

CUDA calls

.

.

.

SetProcessAffinityMask(h,PM);

}

The question now is whether this can be considered a bug. It somehow does not strike me as an expected behavior…

BTW, I might have forgotten to mention that I use WinXPx64.

Hmm, this might be because certain CUDA calls will spin the processor while a kernel is still running. Doing so prevents the process from migrating constantly.

You are using the runtime API? It might be that this does not happen with the driver API. Also which version of CUDA are you using?

I have a simulation that used all 4 cores (and also used CUDA), that is now only running on 2 cores. And I think nothing changed apart from the CUDA version (I did not recompile the .mexa64 file)

I am on linux64.

I am using runtime API and CUDA 2.1.

The change of the affinity, however, happens already after cudaMalloc call. My personal understanding is that the higher processor usage appears only during the memory transfers (cudaMemcpy), when the processor copies data between one of the page locked memories, and the host memory where a user has or want to get their data (I am not 100% sure though).

It may be that for some reason it is convenient to pin the process with CUDA calls to one processor; however, I would expect that after CUDA is done, it would change the affinity mask back to its original state. Perhaps, this is how it is done in the Linux driver. nVidia people would be more fit to explain.

As far as I know the driver does nothing to change the affinity, but I’ll investigate this more soon.

Thus far, unable to repro. What driver versions are being used?

Also, if you have CUDA_PROFILE=1, that will force affinity to a single core.

Yep, CUDA_PROFILE=1 was the culprit. After setting CUDA_PROFILE=0, the affinity was not changed (and Matlab is crunching happily on all four cores).

Thank you for clarifying this.

hooray, glad that’s all it was

For me the same! Thanks for the hint! External Media