Hi,
I’d like to test some algorithm with atomics and I ran the atomicIntrinsics sample from the SDK on Windows+GTX280
and this is what I got:
[simpleAtomicIntrinsics]
CUDA device [GeForce GTX 280] has 30 Multi-Processors, Compute 1.3
[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 280"
Processing time: 102.463837 (ms)
[simpleAtomicIntrinsics] - Test Summary
PASSED
Running the same sample on a Linux system with GTX480 gave the following result:
[simpleAtomicIntrinsics]
> Using CUDA device [0]: GeForce GTX 480
> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities
Processing time: 2162.404053 (ms)
[simpleAtomicIntrinsics] - Test Summary
PASSED
How come the Fermi result is so slow??? Is it only because it is a sample???
Thanks
eyal
avidday
2
Was the Fermi version compiled for sm_20? There could be some JIT (re)compilation going on. When I run it on a GTX470 on linux, I get this:
avidday@cuda:~/simpleAtomicIntrinsics$ ./simpleAtomicIntrinsics
[simpleAtomicIntrinsics]
CUDA device [GeForce GTX 470] has 14 Multi-Processors, Compute 2.0
[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 470"
Processing time: 280.096985 (ms)
[simpleAtomicIntrinsics] - Test Summary
PASSED
Press ENTER to exit...
avidday
3
Was the Fermi version compiled for sm_20? There could be some JIT (re)compilation going on. When I run it on a GTX470 on linux, I get this:
avidday@cuda:~/simpleAtomicIntrinsics$ ./simpleAtomicIntrinsics
[simpleAtomicIntrinsics]
CUDA device [GeForce GTX 470] has 14 Multi-Processors, Compute 2.0
[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 470"
Processing time: 280.096985 (ms)
[simpleAtomicIntrinsics] - Test Summary
PASSED
Press ENTER to exit...
Avid - thanks for the answer. I guess you’re also not in GTC :(
This is straight out of the SDK samples, however this is the line that was used. I guess its ok.
g++ -W -Wall -Wimplicit -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m64 -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2 -o obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o -c simpleAtomicIntrinsics_gold.cpp
/usr/local/cuda/bin/nvcc -gencode=arch=compute_11,code=\"sm_11,compute_11\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -c simpleAtomicIntrinsics.cu -m64 --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2
g++ -fPIC -m64 -o ../../bin/linux/release/simpleAtomicIntrinsics obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -lcutil_x86_64 -lshrutil_x86_64 -ldl -lpthread
In anycase even your result is still twice slower than the result I got for the GTX280
EDIT: Now I’m not so sure its ok… :( the output shows simpleAtomicIntrinsics.cu_11.o and not _20 and if I remove the 11 value from the
SM_VERSIONS line, the code won’t compile. Maybe it is indeed compiled to 11, but why?
EDIT 2: Ok I’ve found why… weird but… you need to change the line CUFILES_sm_11 := simpleAtomicIntrinsics.cu to
CUFILES_sm_20 := simpleAtomicIntrinsics.cu in the Makefile....
In anycase the timings I got were the same as the original post ~2150ms
eyal
Avid - thanks for the answer. I guess you’re also not in GTC :(
This is straight out of the SDK samples, however this is the line that was used. I guess its ok.
g++ -W -Wall -Wimplicit -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m64 -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2 -o obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o -c simpleAtomicIntrinsics_gold.cpp
/usr/local/cuda/bin/nvcc -gencode=arch=compute_11,code=\"sm_11,compute_11\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -c simpleAtomicIntrinsics.cu -m64 --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2
g++ -fPIC -m64 -o ../../bin/linux/release/simpleAtomicIntrinsics obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -lcutil_x86_64 -lshrutil_x86_64 -ldl -lpthread
In anycase even your result is still twice slower than the result I got for the GTX280
EDIT: Now I’m not so sure its ok… :( the output shows simpleAtomicIntrinsics.cu_11.o and not _20 and if I remove the 11 value from the
SM_VERSIONS line, the code won’t compile. Maybe it is indeed compiled to 11, but why?
EDIT 2: Ok I’ve found why… weird but… you need to change the line CUFILES_sm_11 := simpleAtomicIntrinsics.cu to
CUFILES_sm_20 := simpleAtomicIntrinsics.cu in the Makefile....
In anycase the timings I got were the same as the original post ~2150ms
eyal
Hi,
Found out what was the problem :) the cutStartTimer was called at the beginning of the program instead of right before the kernel invocation.
Once the call has been moved right before the kernel, those are the timings:
> Using CUDA device [0]: GeForce GTX 295
> GPU device has 30 Multi-Processors, SM 1.3 compute capabilities
Processing time: 45.803001 (ms)
> Using CUDA device [0]: GeForce GTX 480
> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities
Calling testKernel....
Processing time: 1.362000 (ms)
:)
eyal
Hi,
Found out what was the problem :) the cutStartTimer was called at the beginning of the program instead of right before the kernel invocation.
Once the call has been moved right before the kernel, those are the timings:
> Using CUDA device [0]: GeForce GTX 295
> GPU device has 30 Multi-Processors, SM 1.3 compute capabilities
Processing time: 45.803001 (ms)
> Using CUDA device [0]: GeForce GTX 480
> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities
Calling testKernel....
Processing time: 1.362000 (ms)
:)
eyal