I can access to only the first 8 elements of the array cannot acces to every element of the array

Hi,

I am working on NVIDIA Tesla. I have a 1D array and I would like to assign every element to a thread, thus have number of threads = array size. Whatever the thread/block/grid structure I use, I can only access the first 8 elements, never the rest.

I wrote several CUDA programs with similar/different data structures on other platforms, never had something similar. What is the point I am missing?

Thanks in advance,

Best

Some code might be useful. “Hello my program doesn’t work, how do I fix it?” is not an easy question to answer without at least a modicum of detail…

This is really basic, I just try accessing the data. I even tried to put some stupid thread structure, the result is always the same.

main:
int hostX = (int)malloc(sizeof(int) * N);
for (i = 0; i < N; i++){
hostX[i] = i;
}

int deviceX = NULL;
CUDA_SAFE_CALL(cudaMalloc((void
*) &deviceX, N));

CUDA_SAFE_CALL(cudaMemcpy(deviceX, hostX, N, cudaMemcpyHostToDevice));

//dim3 block(N/8, 1, 1); // whatever I put in
//dim3 threads(N, 1, 1); // whatever I put in
dim3 threads(N, 1, 1); // whatever I put in, let’s leave it this time

access<<<1, threads>>> (deviceX);
cudaThreadSynchronize();

kernel:
global void access(int *x){
printf(“I am reading x[%d] = %d\n”, threadIdx.x, x[threadIdx.x]); // I change array/thread index according to the block structure
}

CUDA_SAFE_CALL(cudaMemcpy(deviceX, hostX, N, cudaMemcpyHostToDevice));

how about N * sizeof(int)

you got that right in the allocation - so why not in the memcpy too?

Same with the cudaMalloc call, you only allocate N bytes instead of sizeof(int)*N bytes.

Sooo simple! Thanks :)