cuda array going from 3D index to linear index

Hi, I’ve been running into a weird problem and I was hoping someone can help me.

I’m not too fancy with words but here goes…

I’m calculating the indices for a cube so in the global void kernel, I calculate the index.x, index.y, index.z for a cubic volume, and their values.
For example, when index[0].x = 0, index[0].y = 0, and index[0].z = 0, then value[0] = 5 and etc.
the index[n] goes from 0 to say 64, so the cubic volume has 646464 elements.
The trouble is number of elements in int3 *index is different than float *value.

so a rough code would look something like this:

dim3 blocks(16,16,16);
dim3 threads(4,4,4);

kernel<<<blocks, threads>>>(index, values);

where:
global void kernel(int3 *index, float *value)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int z = threadIdx.z + blockIdx.z * blockDim.z;

int n = x + y * blockDim.x * gridDim.x +
z * blockDim.x * gridDim.x * blockDim.y * gridDim.y;

index[n].x = some calculation for values between 0 and 63;
index[n].y = some calculation for values between 0 and 63;
index[n].z = some calculation for values between 0 and 63;

float cubic_value = some other calculation that has the same number of elements as n, and calculated from index[n];

int total_index = index[n].z * (64*64) + index[n].y * (64) + index[n].x;

value[total_index[n]] = cubic_value;
}

I can’t seem to get the code to work properly, it compiles and runs and gets most of values right, but every single time I compile and run with exactly the same input, aka, did not change a thing, the values come out differently. I know I’m doing something wrong, this is probably not the best way to code but I can’t seem to figure out a different way to do it. I can cudaMemcpy total_index, and cubic_value to host memory and create a loop in C++ to solve the problem.
For example:
for (int a = 0; a < N; a++)
{
value[ total_index[a] ] = cubic_value[a];
}

But is there a way to solve it in CUDA without having to copy the values to host memory? I have to do this in a fairly large loop, and N is > 1 million points.

Thanks a ton!!!

You have defined what total_index is, but you haven’t defined what total_index[n] is. Regardless, my suspicion is that you have threads stepping on each other as they update value. If you have multiple threads writing to the same location in value, then I would say your code and indeed your algorithm makes no sense.

Stated another way, it’s not clear to me that different threads will necessarily compute unique values for total_index (or total_index[n], whatever that is). If they do not all create unique values, then in some cases more than one thread may write to the same location in value. If that happens, the results will be dependent on order of execution, which doesn’t make much sense to me and is a no-no in parallel programming. A typical suggestion might be to use atomics, but a simple replacement of the final setting of value with an atomic would not eliminate the problem.

If you can provide a simple, complete code, that demonstrates the variability, I’m sure someone can help you with it.

Certainly, replacing things with a sequential loop on the host will eliminate any variability in the results, even if multiple loop iterations are writing to the same location. But this doesn’t make things sensible, to me.

Thank you, what you said made a lot more sense! I’ll whip up a complete version of code right now

The objective isn’t completely clear. What I gather is that you want to

  • Use threadIdx/blockIdx derived values (i,j,k) to compute (x,y,z) = G(i,j,k)
  • Get the linear index n: n = i + 64 j + 4096 k
  • Get the linear index m: m = x + 64 y + 4096 z
  • Compute F(x,y,z)
  • Store index[n] = G(i,j,k)
  • Store value[m] = F(G(i,j,k)) = F(x,y,z)
    for i, j and k each in (0, …, 63)

If this is correct, then as txbob said, unless G is 1:1 (every i,j,k mapped to a unique x,y,z) writes to value[m] will collide no matter what.

CUDA global writes aren’t sequentially consistent which is why you observe pseudorandom results for value[m]. The C++ loop resolves the race condition and generates a consistent result but it doesn’t solve the problem of G not being 1:1. The collisions writing value[m] are still occurring, they just occur in a predictable order.