Wrong results with CUDA threads writing on private locations in global memory

Dear all,
I posted this question to StackOverflow but unfortunately I got no satisfying answer. Maybe you could help me.
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I’ll list the main variables and structures involved.

Variables:

  • srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
  • auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
  • auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
  • c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
  • idx (device) : identification number of current thread

My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.

Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I’m debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run… So I was wondering if you could suggest any other solution to my problem.
Thank you.

The code is here -> http://pastebin.com/9sQ08aZb

… no one can help? it would be great if I could have some insights on how to approach the problem in a different way, if mine is incorrect or not efficient.
Thank you again