Hi, I’ve been running into a weird problem and I was hoping someone can help me.

I’m not too fancy with words but here goes…

I’m calculating the indices for a cube so in the **global** void kernel, I calculate the index.x, index.y, index.z for a cubic volume, and their values.

For example, when index[0].x = 0, index[0].y = 0, and index[0].z = 0, then value[0] = 5 and etc.

the index[n] goes from 0 to say 64, so the cubic volume has 64*64*64 elements.

The trouble is number of elements in int3 *index is different than float *value.

so a rough code would look something like this:

dim3 blocks(16,16,16);

dim3 threads(4,4,4);

kernel<<<blocks, threads>>>(index, values);

where:

**global** void kernel(int3 *index, float *value)

{

int x = threadIdx.x + blockIdx.x * blockDim.x;

int y = threadIdx.y + blockIdx.y * blockDim.y;

int z = threadIdx.z + blockIdx.z * blockDim.z;

int n = x + y * blockDim.x * gridDim.x +

z * blockDim.x * gridDim.x * blockDim.y * gridDim.y;

index[n].x = some calculation for values between 0 and 63;

index[n].y = some calculation for values between 0 and 63;

index[n].z = some calculation for values between 0 and 63;

float cubic_value = some other calculation that has the same number of elements as n, and calculated from index[n];

int total_index = index[n].z * (64*64) + index[n].y * (64) + index[n].x;

value[total_index[n]] = cubic_value;

}

I can’t seem to get the code to work properly, it compiles and runs and gets most of values right, but every single time I compile and run with exactly the same input, aka, did not change a thing, the values come out differently. I know I’m doing something wrong, this is probably not the best way to code but I can’t seem to figure out a different way to do it. I can cudaMemcpy total_index, and cubic_value to host memory and create a loop in C++ to solve the problem.

For example:

for (int a = 0; a < N; a++)

{

value[ total_index[a] ] = cubic_value[a];

}

But is there a way to solve it in CUDA without having to copy the values to host memory? I have to do this in a fairly large loop, and N is > 1 million points.

Thanks a ton!!!