CUDA 2.3a/nvcc frustrations

I am also getting an error I saw quoted in the thread relating to CUDA 2.3a and Thrust, but I get it without even using Thrust. It doesn’t seem to take much to produce it.

This code bit will cause it:


int main()


std::string s = "Hello, world.";

return 0;



This results in:


/usr/include/c++/4.2.1/ext/atomicity.h(51): error: identifier “__sync_fetch_and_add” is undefined

/usr/include/c++/4.2.1/ext/atomicity.h(55): error: identifier “__sync_fetch_and_add” is undefined


Any ideas on how to fix this?

Not using std::string is not an option. :)


As a workaround, try making this the first line of your program:




Unfortunately, this workaround does not always work. Some external libraries, e.g. the boost shared_ptr version 1.40.0, seem to require some functions which are disabled by nvcc. This code, for example,




#include <boost/shared_ptr.hpp>

int main() {

boost::shared_ptr< int > i( new int ) ;

*i = 10 ;

std::cout << *i << std::endl ;

return 0 ;





/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(49): warning: “cc” clobber ignored

/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(65): warning: “cc” clobber ignored

/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(91): warning: “cc” clobber ignored

/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp(75): warning: variable “tmp” was set but never used

/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/spinlock_sync.hpp(40): error: identifier “__sync_lock_test_and_set” is undefined

/opt/boost/boost_1_40_0/include/boost/smart_ptr/detail/spinlock_sync.hpp(54): error: identifier “__sync_lock_release” is undefined


Compiling the same code with g++ does not even generate a warning. Does anybody have an idea what to do about this, other than wait for a fix from NVIDIA?

Why are you compiling C++ code with nvcc? I only compile CUDA kernels and C host functions that call them with nvcc. I compile all of my C++ code with gcc and then link to the C functions from the nvcc compile.

If you’ve carefully separated out all of your host code into its own files, and made a pure C interface between them so they don’t need to see any C++ data types or call any C++ functions or methods, that’s great. Not all CUDA programs are written that way!


I think one of NVIDIAs greatest errors was creating two APIs for using CUDA: driver and runtime.

The runtime API is really just an invitation to trouble.

Maybe I just like to live dangerously, but I actually quite like the runtime API :)

It’s true the that runtime has some shortcomings and limitations. On the other hand, achieving close coupling between host/device code is a hard problem, and significantly more ambitious than the driver API. Consider how hard it would be to implement something like Thrust [1] with only the CUDA driver API (or OpenCL, for that matter). At minimum you’d need a C++ compiler front-end that was able to instantiate templates for both the host and device.

In other words, you’d need something roughly equivalent to nvcc :)

Anyway, I don’t mean to dismiss or trivialize the criticism of the runtime API. We’ve certainly encountered our fair share of nvcc bugs in developing Thrust. However, as we reported these issues support for templates, namespaces, and other C++ features rapidly improved. This bug on Snow Leopard notwithstanding, nvcc is now a fairly competent C++ compiler and the preferred interface for many smaller-scale CUDA developers.


This will be fixed in the 3.0 release.

until then, not using this feature in .cu files is the only WAR

I see a big flaw with CUDA in that NVCC is called at compile time, so no mater how nice support for templates etc is, you always need to recompile your code. That makes it impossible to generalize complex algorithms. It also makes it impossible to take advantage of the actual GPU capabilities. No mater which api you use.

So much development time spent at CUDA, but nobody saw that as CUDA’s major shortcoming?

OpenCL has gone a much better route here. Through the use of #define’s you can write algorithms in OpenCL that are not possible in CUDA right now. OpenCL sure is missing a nice templates preprocessor to make it even more flexible - but that should be a minor task to develop.

You can argue that the runtime api makes some things easier to develop. But does that justify the added complexity of having to maintain two different api’s?

I really dislike all these SDK samples that ship in a single .CU file. All would be much easier to understand if device and host code where never mixed into one file, which is exactly what dwalthour suggests.

In my opinion the runtime api makes no sense at all and I can imagine how much development time it is taking from the CUDA development team. I don’t want NVIDIA compiling my host code, I trust other tools for that. I want NVCC excelling at device code generation - nothing else.

It will be interesting to see how much CUDA developers will be leaving to OpenCL in the next month’s/years.

Could you elaborate? What exactly is the feature we’re not supposed to use in .cu files? The code that Brett posted as triggering the error was little more than an empty main(). Was it the use of std::string? Something else?


Including standard C++ headers in .cu files on Snow Leopard are problematic. The workarounds are to either

(1) add #undef _GLIBCXX_ATOMIC_BUILTINS to the top of your .cu files, or

(2) move the C++ code to a .cpp file, or

(3) revert to an earlier compiler (i.e. not GCC 4.2 on Snow Leopard).

Left as an exercise for the reader?

Off course - mine is working fine already :)

Thanks! That was the clue I needed to get things working. I created a symbolic link to gcc 4.0 in my build directory:

ln -s /usr/bin/gcc-4.0 gcc

I then used the --compiler-bindir option to nvcc to tell it to use that as the compiler. I can finally do CUDA development on my Mac again!