I posted this question to StackOverflow but unfortunately I got no satisfying answer. Maybe you could help me.
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I’ll list the main variables and structures involved.
- srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
- auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
- auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
- c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
- idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I’m debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run… So I was wondering if you could suggest any other solution to my problem.
The code is here -> http://pastebin.com/9sQ08aZb