I have started code with cuda less than a month ago. I am using Geforce GT610 to execute the code but I am getting very bad performance results. I could execute the sample code SimpleMatrixMul with out much effort but the speed performance is about 1.33FLOPS which is very low compared to what the wikipedia says on GT610 performance (which is 155GFLOPS).
I also made very simple codes ( add and multiply together 3 variable with array size of 2 Mega elements) and every time I see the CPU (intel I3-4160) is faster than the GPU (even when not taking account the time to copy variables to device from host or device to host).
The GT610 graphic card is on PCIE2.0 x16 bay and the motherboard has its own DVI output which I am using for display. Did I miss something basic with setting up cuda?.. PC OS is Win7- 64 bit, with cuda 7.0 env integrated to Visual Studio 2013…
How are you compiling the code? Please show the comple nvcc command line. Make sure you are using a release build, not a debug build. In particular, make sure the nvcc command line does not contain the -G switch.
I have just observed that the speed is 14GLOPS…not 1.3GFLOPS as I mentiioned earlier… but I believe there is still a lot of speed improvemets can be achieved by my GPU…
the output of my matrixMult is following
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\bin\win64\Release>matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “GeForce GT 610” with compute capability 2.1
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 13.06 GFlop/s, Time= 10.039 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Is there a source code exist to achieve the maximum computation speed? How do they specify 155GFLOPS for GT610?.. is it theoretical calculations or tested results?