Atomic intrinsics

Hi,

I’d like to test some algorithm with atomics and I ran the atomicIntrinsics sample from the SDK on Windows+GTX280

and this is what I got:

[simpleAtomicIntrinsics]

CUDA device [GeForce GTX 280] has 30 Multi-Processors, Compute 1.3

[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 280"

Processing time: 102.463837 (ms)

[simpleAtomicIntrinsics] - Test Summary

PASSED

Running the same sample on a Linux system with GTX480 gave the following result:

[simpleAtomicIntrinsics]

> Using CUDA device [0]: GeForce GTX 480

> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities

Processing time: 2162.404053 (ms)

[simpleAtomicIntrinsics] - Test Summary

PASSED

How come the Fermi result is so slow??? Is it only because it is a sample???

Thanks

eyal

Was the Fermi version compiled for sm_20? There could be some JIT (re)compilation going on. When I run it on a GTX470 on linux, I get this:

avidday@cuda:~/simpleAtomicIntrinsics$ ./simpleAtomicIntrinsics 

[simpleAtomicIntrinsics]

CUDA device [GeForce GTX 470] has 14 Multi-Processors, Compute 2.0

[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 470"

Processing time: 280.096985 (ms)

[simpleAtomicIntrinsics] - Test Summary

PASSED

Press ENTER to exit...

Was the Fermi version compiled for sm_20? There could be some JIT (re)compilation going on. When I run it on a GTX470 on linux, I get this:

avidday@cuda:~/simpleAtomicIntrinsics$ ./simpleAtomicIntrinsics 

[simpleAtomicIntrinsics]

CUDA device [GeForce GTX 470] has 14 Multi-Processors, Compute 2.0

[simpleAtomicIntrinsics]: Using Device 0: "GeForce GTX 470"

Processing time: 280.096985 (ms)

[simpleAtomicIntrinsics] - Test Summary

PASSED

Press ENTER to exit...

Avid - thanks for the answer. I guess you’re also not in GTC :(

This is straight out of the SDK samples, however this is the line that was used. I guess its ok.

g++ -W -Wall -Wimplicit -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function   -m64 -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2  -o obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o -c simpleAtomicIntrinsics_gold.cpp

/usr/local/cuda/bin/nvcc  -gencode=arch=compute_11,code=\"sm_11,compute_11\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -c simpleAtomicIntrinsics.cu  -m64 --compiler-options -fno-strict-aliasing  -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2

g++ -fPIC   -m64 -o ../../bin/linux/release/simpleAtomicIntrinsics obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o	obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o	-L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart	 -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -lcutil_x86_64 -lshrutil_x86_64 -ldl -lpthread

In anycase even your result is still twice slower than the result I got for the GTX280

EDIT: Now I’m not so sure its ok… :( the output shows simpleAtomicIntrinsics.cu_11.o and not _20 and if I remove the 11 value from the

SM_VERSIONS line, the code won’t compile. Maybe it is indeed compiled to 11, but why?

EDIT 2: Ok I’ve found why… weird but… you need to change the line CUFILES_sm_11 := simpleAtomicIntrinsics.cu to

CUFILES_sm_20  := simpleAtomicIntrinsics.cu in the Makefile....

In anycase the timings I got were the same as the original post ~2150ms

eyal

Avid - thanks for the answer. I guess you’re also not in GTC :(

This is straight out of the SDK samples, however this is the line that was used. I guess its ok.

g++ -W -Wall -Wimplicit -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function   -m64 -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2  -o obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o -c simpleAtomicIntrinsics_gold.cpp

/usr/local/cuda/bin/nvcc  -gencode=arch=compute_11,code=\"sm_11,compute_11\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o -c simpleAtomicIntrinsics.cu  -m64 --compiler-options -fno-strict-aliasing  -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2

g++ -fPIC   -m64 -o ../../bin/linux/release/simpleAtomicIntrinsics obj/x86_64/release/simpleAtomicIntrinsics_gold.cpp.o	obj/x86_64/release/simpleAtomicIntrinsics.cu_11.o	-L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart	 -L/usr/local/cuda/lib64 -L../../lib -L../../common/lib/linux -L../../../shared//lib -lcudart -lcutil_x86_64 -lshrutil_x86_64 -ldl -lpthread

In anycase even your result is still twice slower than the result I got for the GTX280

EDIT: Now I’m not so sure its ok… :( the output shows simpleAtomicIntrinsics.cu_11.o and not _20 and if I remove the 11 value from the

SM_VERSIONS line, the code won’t compile. Maybe it is indeed compiled to 11, but why?

EDIT 2: Ok I’ve found why… weird but… you need to change the line CUFILES_sm_11 := simpleAtomicIntrinsics.cu to

CUFILES_sm_20  := simpleAtomicIntrinsics.cu in the Makefile....

In anycase the timings I got were the same as the original post ~2150ms

eyal

Hi,

Found out what was the problem :) the cutStartTimer was called at the beginning of the program instead of right before the kernel invocation.

Once the call has been moved right before the kernel, those are the timings:

> Using CUDA device [0]: GeForce GTX 295

> GPU device has 30 Multi-Processors, SM 1.3 compute capabilities

Processing time: 45.803001 (ms)
> Using CUDA device [0]: GeForce GTX 480

> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities

Calling testKernel....

Processing time: 1.362000 (ms)

:)

eyal

Hi,

Found out what was the problem :) the cutStartTimer was called at the beginning of the program instead of right before the kernel invocation.

Once the call has been moved right before the kernel, those are the timings:

> Using CUDA device [0]: GeForce GTX 295

> GPU device has 30 Multi-Processors, SM 1.3 compute capabilities

Processing time: 45.803001 (ms)
> Using CUDA device [0]: GeForce GTX 480

> GPU device has 15 Multi-Processors, SM 2.0 compute capabilities

Calling testKernel....

Processing time: 1.362000 (ms)

:)

eyal