how can multithreads modify the common variable in device memory?

Hi all,
I am trying to write a kernel which performs some tasks and then increase the common variables defined in device memory.

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 30
int InitGPUSet()  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
    for (int i = 0; i < num; ++ i)  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU ,;
   else return 0;  
   return 1;  

__global__ void findStd(double A0, double B0, int *data)
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;

int main(void)
    puts("device is not ready!");  
    cout << "error" << endl;
    int *Hdata;
    int *Ddata;

    Hdata = (int*)malloc(sizeof(int));
    cudaMalloc((void**)&Ddata, sizeof(int));

    *Hdata = 0;
    cudaMemcpy(Ddata, Hdata, sizeof(int), cudaMemcpyHostToDevice);
    findStd<<<N, N>>>(0, 0, Ddata);
    cudaMemcpy(Hdata, Ddata, sizeof(int), cudaMemcpyDeviceToHost);
    cout << *Hdata << endl;

What I did above is just for testing. In real application, the kernel will do some tasks to determine is the common variable (*data) will be increased or not. Above code runs without any error. But it always return 0 instead of 900 as I expected. Anything I am missing here? Thanks.

Should “*data++” be “(*data)++” or “++*data”?

I don’t know what you’re trying to accomplish, but you probably want to look at using atomic operations if more than one thread is updating the same word.

Oh, how careless I am. I fix that but still don’t know why if only give me 1 for (*data) even I feed the kernel with #grid=10, #threads=10. I expect it gives me 100 instead.

p.s. my card is Tesla C2075

It’s 1 because 100 threads are reading the value 0, incrementing it locally and then writing it back. If you were on a smaller GPU or increase your number of blocks and threads it might start producing slightly higher numbers (but still wrong).

Check out the chapter on atomics.

If you replace “(*data)++” with “atomicAdd(data,1)” it will safely increment.

wow, that’s cool. It solves immediately. I change from serial programming and know pretty little about the parallel programming. I think I need to learn more :)

This is a reduction problem which is a little trickier on gpu than in serial implementations. If you just want to learn more check these slides

Programming CUDA is not just knowing all the possible fucntions, but also one needs to think in a “parallel” way. ( I started with CUDA by example book).