Help me... Cuda program execution is slower than CPU...Did I miss any settings??

naushica · September 24, 2015, 6:15am

I have started code with cuda less than a month ago. I am using Geforce GT610 to execute the code but I am getting very bad performance results. I could execute the sample code SimpleMatrixMul with out much effort but the speed performance is about 1.33FLOPS which is very low compared to what the wikipedia says on GT610 performance (which is 155GFLOPS).

I also made very simple codes ( add and multiply together 3 variable with array size of 2 Mega elements) and every time I see the CPU (intel I3-4160) is faster than the GPU (even when not taking account the time to copy variables to device from host or device to host).

The GT610 graphic card is on PCIE2.0 x16 bay and the motherboard has its own DVI output which I am using for display. Did I miss something basic with setting up cuda?.. PC OS is Win7- 64 bit, with cuda 7.0 env integrated to Visual Studio 2013…

njuffa · September 24, 2015, 6:45am

How are you compiling the code? Please show the comple nvcc command line. Make sure you are using a release build, not a debug build. In particular, make sure the nvcc command line does not contain the -G switch.

naushica · September 24, 2015, 9:04am

Hi, see below.

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\0_Simple\matrixMul>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0\bin\nvcc.exe” -gencode=arch=compute_20,code="sm_20,compute_20" --use-local-env --cl-version 2013 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_amd64” -I./ -I…/…/common/inc -I./ -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0/include" -I…/…/common/inc -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler “/wd 4819” -use_fast_math -DWIN32 -DWIN32 -D_MBCS -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Ox /Zi /MT " -o x64/Release/matrixMul.cu.obj “c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\0_Simple\matrixMul\matrixMul.cu”

naushica · September 24, 2015, 9:11am

I have just observed that the speed is 14GLOPS…not 1.3GFLOPS as I mentiioned earlier… but I believe there is still a lot of speed improvemets can be achieved by my GPU…

the output of my matrixMult is following

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\bin\win64\Release>matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “GeForce GT 610” with compute capability 2.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 13.06 GFlop/s, Time= 10.039 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

cbuchner1 · September 24, 2015, 9:15am

I think a matrix multiplication of that size might be bounded by global memory access speed.

There is a certain likelyhood that the max. 14.4 GBps of memory bandwidth of the Geforce 610 are a limiting factor here.

Try using the profiling tools in NSight to check the memory related performance metrics.

Christian

naushica · September 24, 2015, 2:20pm

Is there a source code exist to achieve the maximum computation speed? How do they specify 155GFLOPS for GT610?.. is it theoretical calculations or tested results?

Topic		Replies	Views
CUDA slower than CPU? CUDA Programming and Performance	7	817	August 18, 2023
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13330	February 17, 2010
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13321	July 9, 2008
slow kernel CUDA Programming and Performance	4	1445	June 25, 2009
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	490	November 19, 2018
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6546	February 19, 2009
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	226	July 7, 2024
unexpected slow performance CUDA Programming and Performance	0	368	February 29, 2020
cuda gpu slower than cpu CUDA Programming and Performance	2	1087	May 1, 2012
cuda phylosophy is that really C? CUDA Programming and Performance	12	8226	May 6, 2008

Help me... Cuda program execution is slower than CPU...Did I miss any settings??

the output of my matrixMult is following

Related topics