Atomic functions Better programming on cheaper GPU ?

I have a 8800 gtx nvidia gpu however when I wanted to use atomic functions that only 8600 GPUs have that support. The expensive GPU cannot use those functions and I cannot use global variables in a way I want to use.

The architecture is 1.0 but 1.1 architecture has to be given (-arch sm_11) to compiler to use atomic functions.

Will the future releases support atomic functions for 1.0 architecture or should I downgrade my GPU to 8600 ?

Unfortunately I would suggest you downgrade; as I did.

G80-based cards (GeForce 8800 / Quadro FX 5600) only have hardware for compute capability 1.0 (sm_10). There is no way atomic operations can be enabled by a software upgrade.

It is quite common for NVIDIA to introduce new features on low-end cards first since these are often later in the product cycle. Future products will support compute 1.1 or higher.

If you can tell us what you’re trying to do with atomics, we may be able to suggest workarounds for earlier hardware.

I was trying to divide a process which calculates some variables like “mean” and “standard deviation”. All these requires access from different blocks and threads to the same variable.

For threads the solution seem to be an easy one, using a shared variable and letting one thread after a __syncthread() function do the job.

But the communication from different blocks creates a problem. Using a global variable may solve the problem in a long way (using an array of the same size as the number of the blocks and with using blockidx storing all the variables in different memory spaces. Then calling a kernbel function to sum them up). I cannot access the global variable in kernel function since two different blocks may access the same variable at the same time and try to change the same value (which must have been first increased for example) and get the calculation wrong.

Is there any faster way doing this, without calling two different kernel functions ?

Yes, it is possible to do these kinds of operations in CUDA without using atomics by using what are called parallel reductions. Here are some good references:…performance.ppt…an/doc/scan.pdf