Same code, mex is much slower and pure C, why?

I have a CUDA program for Matlab, but the mex version is much slower than the Visual Studio version, though the code are identical except the brief mexFunction for in/out arguments. The mex version takes 3 seconds while the pure C takes 0.5 second.

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. I followed steps for the by MATLAB, only changing the setting to -gencode=arch=compute_30,code=“sm_30,compute_30” (deleting the lower version flags).

Are theere any hints why the is mex version is much slower? Any solution for this?

I have answered your question on StackOverflow.

Just for reference, assuming the following scenario:

You have a CUDA code that, when compiled as a standalone program under Visual Studio, is faster than when interfaced by the mexFunction and compiled to be invoked under Matlab.

The first call to the mexFunction is generally “slow” since the CUDA context is setup, the kernel is processed by the driver, and the code is uploaded to the GPU.

Accordingly, to have a meaningful estimate of the execution time, one should first “warm up” the GPU by calling the kernel once, and then time the execution of subsequent calls. The timing should be calculated as the average time of many calls if the code is very fast.

Hi, JFSebastian, I edited my question in StackOverflow. Please have a look. Both pure C and mex call to a same C function that uses GPU. Do you think I still need to warm up something for fair comparison?

The StackOverflow link