I am working on NVIDIA Tesla. I have a 1D array and I would like to assign every element to a thread, thus have number of threads = array size. Whatever the thread/block/grid structure I use, I can only access the first 8 elements, never the rest.
I wrote several CUDA programs with similar/different data structures on other platforms, never had something similar. What is the point I am missing?
This is really basic, I just try accessing the data. I even tried to put some stupid thread structure, the result is always the same.
main:
int hostX = (int)malloc(sizeof(int) * N);
for (i = 0; i < N; i++){
hostX[i] = i;
}
int deviceX = NULL;
CUDA_SAFE_CALL(cudaMalloc((void*) &deviceX, N));
CUDA_SAFE_CALL(cudaMemcpy(deviceX, hostX, N, cudaMemcpyHostToDevice));
//dim3 block(N/8, 1, 1); // whatever I put in
//dim3 threads(N, 1, 1); // whatever I put in
dim3 threads(N, 1, 1); // whatever I put in, let’s leave it this time
kernel: global void access(int *x){
printf(“I am reading x[%d] = %d\n”, threadIdx.x, x[threadIdx.x]); // I change array/thread index according to the block structure
}