Cuda thread privace

//CUDA Code
if (comm < d_g.nodes) {
    for (int node = d_community_pos[comm]; node < d_community_pos[comm + 1]; node++) {
        for (int neighbor = d_g.out_col[d_community_list[node]];
             neighbor < d_g.out_col[d_community_list[node] + 1];
             neighbor++) {
            
            int n_neig = d_g.child_out[neighbor];
            int neig_comm = d_p.node_comm[n_neig];
            
            d_count[d_fake_ids[neig_comm]] += 1;
        }
    }
}


Each thread (i) in this code should create a new copy of d_count and that should not be visible to or should not be modified by other threads. but in my case all the 5 threads modify the same copy and keep adding on exiting values, how to solve this problem ? 

I tried declaring d_count in local memory and it worked but on the small data, when i tested it on big data it does not work because there’s a limit on local memory usage we cannot use more than 512KB memory for each thread,

I tried to make d_count zero for each i, it also did not work.

Any suggestions how to make d_count a private array for each thread withou using local memory ?

any cuda experts please

I already posted 2 suggestions there, as comments.

I am very new to gpu coding, and increasing heapsize doenot work for me, it just exits the kernel code.

None of my suggestions were about heap size. the two suggestions were here and here.

Thanks. Robert.