Help me... Cuda program execution is slower than CPU...Did I miss any settings??

I have started code with cuda less than a month ago. I am using Geforce GT610 to execute the code but I am getting very bad performance results. I could execute the sample code SimpleMatrixMul with out much effort but the speed performance is about 1.33FLOPS which is very low compared to what the wikipedia says on GT610 performance (which is 155GFLOPS).

I also made very simple codes ( add and multiply together 3 variable with array size of 2 Mega elements) and every time I see the CPU (intel I3-4160) is faster than the GPU (even when not taking account the time to copy variables to device from host or device to host).

The GT610 graphic card is on PCIE2.0 x16 bay and the motherboard has its own DVI output which I am using for display. Did I miss something basic with setting up cuda?.. PC OS is Win7- 64 bit, with cuda 7.0 env integrated to Visual Studio 2013…

How are you compiling the code? Please show the comple nvcc command line. Make sure you are using a release build, not a debug build. In particular, make sure the nvcc command line does not contain the -G switch.

Hi, see below.

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\0_Simple\matrixMul>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0\bin\nvcc.exe” -gencode=arch=compute_20,code=“sm_20,compute_20” --use-local-env --cl-version 2013 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_amd64” -I./ -I…/…/common/inc -I./ -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0/include" -I…/…/common/inc -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler “/wd 4819” -use_fast_math -DWIN32 -DWIN32 -D_MBCS -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Ox /Zi /MT " -o x64/Release/ “c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\0_Simple\matrixMul\”

I have just observed that the speed is 14GLOPS…not 1.3GFLOPS as I mentiioned earlier… but I believe there is still a lot of speed improvemets can be achieved by my GPU…

the output of my matrixMult is following

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\bin\win64\Release>matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “GeForce GT 610” with compute capability 2.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
Performance= 13.06 GFlop/s, Time= 10.039 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

I think a matrix multiplication of that size might be bounded by global memory access speed.

There is a certain likelyhood that the max. 14.4 GBps of memory bandwidth of the Geforce 610 are a limiting factor here.

Try using the profiling tools in NSight to check the memory related performance metrics.


Is there a source code exist to achieve the maximum computation speed? How do they specify 155GFLOPS for GT610?.. is it theoretical calculations or tested results?