I have 64*128=8192 threads.
For each thread,I need to input a list of 624 integers.
Each thread will have to work (read/write) on this list without interact with
the other lists of other threads.
In the global function I give function(…,int * d_my_list)
Where d_my_list was a Cudacopy HostToDevice of h_my_list
CUDA_SAFE_CALL( cudaMalloc((void **)&d_my_list,8192624sizeof(int)) );
CUDA_SAFE_CALL( cudaMemcpy(d_my_list,h_my_list,8192624sizeof(int)) ,cudaMemcpyHostToDevice) );
8192624sizeof(int))=20Mo. (Each thread works only with the 624 integers,
and I want to use the fastest memory)
In the global function(…,int * d_my_list) I put:
unsigned int local_list;
const int THREAD_N = blockDim.x * gridDim.x;//128*64 blocs=8192
const tid = blockDim.x * blockIdx.x + threadIdx.x;
and I have crash with that.
If I put
shared unsigned int local_list no crash, but the
local_list seems to be really “shared”: I need each thread works(read/write) independantly with others.
In EmuDebug, I see always the same adress of my local_list, and in Release I have some random result and so I think there are conflicts beetween the local_lists, that each thread modifies.
Can someone help me to solve my need. It will solve my previous Post too.