how can multithreads modify the common variable in device memory?

Hi all,
I am trying to write a kernel which performs some tasks and then increase the common variables defined in device memory.

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 30
int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, int *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  __syncthreads();
  (*data)++;
}

int main(void)
{
  if(!InitGPUSet())  
  {
    puts("device is not ready!");  
    cout << "error" << endl;
  }
  else  
  {  
    int *Hdata;
    int *Ddata;

    Hdata = (int*)malloc(sizeof(int));
    cudaMalloc((void**)&Ddata, sizeof(int));

    *Hdata = 0;
    cudaMemcpy(Ddata, Hdata, sizeof(int), cudaMemcpyHostToDevice);
    findStd<<<N, N>>>(0, 0, Ddata);
    cudaMemcpy(Hdata, Ddata, sizeof(int), cudaMemcpyDeviceToHost);
    cout << *Hdata << endl;
    free(Hdata);
    cudaFree(Ddata);
  }
}

What I did above is just for testing. In real application, the kernel will do some tasks to determine is the common variable (*data) will be increased or not. Above code runs without any error. But it always return 0 instead of 900 as I expected. Anything I am missing here? Thanks.

Should “*data++” be “(*data)++” or “++*data”?

I don’t know what you’re trying to accomplish, but you probably want to look at using atomic operations if more than one thread is updating the same word.

Oh, how careless I am. I fix that but still don’t know why if only give me 1 for (*data) even I feed the kernel with #grid=10, #threads=10. I expect it gives me 100 instead.

p.s. my card is Tesla C2075

It’s 1 because 100 threads are reading the value 0, incrementing it locally and then writing it back. If you were on a smaller GPU or increase your number of blocks and threads it might start producing slightly higher numbers (but still wrong).

Check out the chapter on atomics.

If you replace “(*data)++” with “atomicAdd(data,1)” it will safely increment.

wow, that’s cool. It solves immediately. I change from serial programming and know pretty little about the parallel programming. I think I need to learn more :)

This is a reduction problem which is a little trickier on gpu than in serial implementations. If you just want to learn more check these slides http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Programming CUDA is not just knowing all the possible fucntions, but also one needs to think in a “parallel” way. ( I started with CUDA by example book).