Necessary includes for atomicAdd?

Hey there,

Turns out I am too stupid to use atomicAdd. I can’t get my kernel to compile as soon as i add a line with a call to “atomicAdd”.", line 44: error: identifier "atomicAdd" is undefined

This is what I get. I’ve got a 8600 GT. The kernel has the following line at the very top:

#include "device_functions.h"

Is there anything else I need to include or take care of? Any compiler flags? I also tried using the absolute path “C:\CUDA\include\device_functions.h” for the include but it still wouldnt work.

Once again I should’ve been thinking a bit harder before posting. I got this problem resolved by looking at the histogram64 sample in the SDK. Its just another command line argument that is necessary to be added.

next time you solve something please actually post the answer: nvcc flags –gpu-name compute_11 as on man nvcc.

On CUDA 2.3, it’s changed to “-arch compute_11” to include global memory atomics, and “-arch compute_12” for global and shared memory atomics.

Anyone know where should I include the command “-arch compute_11”? Thanks.

Below is my cuda c code, and I got the error: error: identifier “atomicAdd” is undefined

anyone can advise? Many thanks.

#include “device_functions.h”

#include <cuda_runtime.h>

#include “cuda.h”

#include “C:\Users\a0034508\Desktop\Research\CUDA in VC\common\book.h”

#include “C:\Users\a0034508\Desktop\Research\CUDA in VC\common\cpu_anim.h”

#define GPU_ARCH 10

#define GPU_arch_sm_10 10

#define GPU_arch_sm_11 11

#define GPU_arch_sm_12 12

#define GPU_arch_sm_13 13

#define architecture(s) GPU_arch_sm##s##_

#define SIZE (10010241024)

global void histo_kernel( unsigned char *buffer,long size,unsigned int *histo )


int i = threadIdx.x + blockIdx.x * blockDim.x;

int stride = blockDim.x * gridDim.x;

while (i < size)


atomicAdd( &(histo[buffer[i]]), 1 );

i += stride;



int main( void )


unsigned char buffer = (unsigned char)big_random_block( SIZE );

cudaEvent_t start, stop;

HANDLE_ERROR( cudaEventCreate( &start ) );

HANDLE_ERROR( cudaEventCreate( &stop ) );

HANDLE_ERROR( cudaEventRecord( start, 0 ) );

// allocate memory on the GPU for the file’s data

unsigned char *dev_buffer;

unsigned int *dev_histo;

HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );

HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,cudaMemcpyHostToDevice ) );

HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,256 * sizeof( long ) ) );

HANDLE_ERROR( cudaMemset( dev_histo, 0,256 * sizeof( int ) ) );

cudaDeviceProp prop;

HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );

int blocks = prop.multiProcessorCount;

histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

unsigned int histo[256];

HANDLE_ERROR( cudaMemcpy( histo, dev_histo,256 * sizeof( int ),cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );

HANDLE_ERROR( cudaEventSynchronize( stop ) );

float elapsedTime;

HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,

start, stop ) );

printf( “Time to generate: %3.1f ms\n”, elapsedTime );

long histoCount = 0;

for (int i=0; i<256; i++) {

histoCount += histo[i];


printf( “Histogram Sum: %ld\n”, histoCount );

// verify that we have the same counts via CPU

for (int i=0; i<SIZE; i++)


for (int i=0; i<256; i++) {

if (histo[i] != 0)

printf( “Failure at %d!\n”, i );


HANDLE_ERROR( cudaEventDestroy( start ) );

HANDLE_ERROR( cudaEventDestroy( stop ) );

cudaFree( dev_histo );

cudaFree( dev_buffer );

free( buffer );

return 0;


I also change the GPU architecture: sm_12 at CUDA Build Rule v3.0.0.
However, my CUDA version is v3.2, does that matter?


If you’re using Visual studio, you should right click the solution and select properties, when you get the cuda runtime api - gpu dialog (se attatched jpg) box change the gpuarchitecture(1) to sm_13

you can add “-arch compute_11” when you compile your code nvcc -arch compute_11

how should I change the gpuarchitecture on code blocks?

To solve errors from analyzers (before compilation) from Visual Studio or Resharper:

#if defined (__INTELLISENSE__) | defined (__RESHARPER__)
template<class T1, class T2>
__device__ void atomicAdd(T1 x, T2 y);