I have a CUDA program for Matlab, but the mex version is much slower than the Visual Studio version, though the code are identical except the brief mexFunction for in/out arguments. The mex version takes 3 seconds while the pure C takes 0.5 second.
I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. I followed steps for the mexGPUExample.cu by MATLAB, only changing the setting to -gencode=arch=compute_30,code=“sm_30,compute_30” (deleting the lower version flags).
Are theere any hints why the is mex version is much slower? Any solution for this?