I’m having problems with my “first” true CUDA program. You’ll find it attached in this post. (Compile it with nvcc dataLoader.cu -o myprog if you need to, sorry but most of the variables have a french name and the code is badly written and defintions of constants is not really well done…)
A little explanation : this program gets a king of matrix (int *data) and have to count the occurencies of each element of this matrix. Each block have a part of memory to store its result (int *result) . The threads of one block work on the shared memory (shared int sharedVar).
It works fine with at most 32 threads and no limits for blocks. But when i use 1 block and more than 32 threads then results are totally wrong. I can’t figure out why…
Can you help me?
I’m using CUDA on a Debian 64 bits.