N-dimensional array reordering Strange errors for large array reordering on GPUs

Hello,

Some short background on what I am doing…

I am optimizing a few tensor contraction kernels for GPUs and the generic algorithm is to reorder the n-dimensional tensor array so that the contraction indices align properly such that a matrix-multiply performs the contraction. Then another re-ordering is done on the resulting matrix to put it into the correct final form.

For example contracting over indices i and j in X1[a][i][c][j] * X2[b][j][i][d] = Y[a][b][c][d],

I would reorder to X1[i][j][a][c] * X2[b][d][i][j] = Y[b][d][a][c] and then reorder Y to the proper order

  • note cublas is column-major ordered

The problem…

On a Tesla C2070 it works flawlessly while on a GeForce GTX 590 I get strange memory issues when I scale the number of thread blocks up greater than 1.

The problem is in the reordering kernel I have written and I do not understand why it works fine on the C2070 (computer capability 2.0) and not on the GTX 590 (also 2.0).

Below is a kernel to a specific case of reordering a 4D tensor. This one is simple enough and equivalent to simply performing a matrix transpose operation on the array as if it were a 2D ab by cd matrix.

a, b, c, and d are the dimension sizes and the length of the array is abc*d

newX and oldX are allocated buffers via cudaMalloc

oldX was copied into by cudaMemcpy from the host and is confirmed to store the correct values (in my test cases all 1.000s)

newX is where I want the reordered tensor to sit (in my test cases should again be all 1.000s)

__global__ void permuteABCD_CDAB(double* newX, double* oldX, int a, int b, int c, int d) {

  int oldIndex = blockIdx.x * blockDim.x + threadIdx.x;

while(oldIndex < a*b*c*d) {

    newX[(((oldIndex % (c*d)) / d) * a*b*d) + ((oldIndex % d) * a*b) + 

    	 ((oldIndex / (b*c*d)) * b) + ((oldIndex % (b*c*d)) / (c*d))] = oldX[oldIndex];

    oldIndex += gridDim.x * blockDim.x;

  }

}

C2070: works perfect

GTX 590:

for permuteABCD_CDAB<<<1,256>>>(newX, oldX, a, b, c, d); it works perfect

for permuteABCD_CDAB<<<2,256>>>(newX, oldX, a, b, c, d); 0 to 5 indices are messed up

newX…oldX…broken index

0.250000, 1.000000, index=11085

0.003906, 1.000000, index=128554

for permuteABCD_CDAB<<<3,256>>>(newX, oldX, a, b, c, d); 10ish indices are messed up

0.250000, 1.000000, index=11085

0.062500, 1.000000, index=24460

0.000000, 1.000000, index=41107

0.000000, 1.000000, index=61725

0.000000, 1.000000, index=87669

0.500000, 1.000000, index=93196

1.000000, 0.062500, index=154410

0.062500, 1.000000, index=166242

0.003906, 1.000000, index=187258

0.000000, 1.000000, index=266921

and now we can see that 1 of the values in oldX has gone bad and while it doesn’t happen every time, similar indices tend to mess up more often such as above the index 11085 broke in 2 consecutive tries with a launch configuration of 2 and 3 blocks.

as I increase the number of blocks contributing to the re-ordering the number of errors in the values quickly ramps up with some errors being inf or nan and a lot being 0.000 or 0.500 or 0.25000 and there will be 1000’s of incorrect values

Again this is only on the GTX590, the C2070 works perfectly.

The last thing to add is that the C2070 has ECC enabled and the GTX590 currently has it turned off. I don’t know if that would have any effect on this but worth noting I guess. The main thing to take away is that scaling this kernel on the C2070 was fine with however many blocks I used while on the GTX 590 it is only working for 1 block.

Anyone able to make any sense of this?

(Just to check the trivial things: Are you sure you are calling the kernel with a==c and b==d?)

Assuming this has no trivial reason, this sounds to me like a cache coherency issue. It might manifest on the GTX590 only and not on the C2070 because of the different number of memory channels, thus different mapping of addresses to memory controllers. Try switching off the L1 cache by compiling with [font=“Courier New”]-Xptxas -dlcm=cg[/font].

In general, a ‘gather’ style kernel produces far fewer cacheline bounces and less cache coherency traffic than a ‘scatter’ style one. So this code might even execute a bit faster:

__global__ void permuteABCD_CDAB(double* newX, double* oldX, int a, int b, int c, int d) {

  int newIndex = blockIdx.x * blockDim.x + threadIdx.x;

while(newIndex < a*b*c*d) {

    newX[newindex] = oldX[(((newIndex % (c*d)) / d) * a*b*d) + ((newIndex % d) * a*b) + 

         ((newIndex / (b*c*d)) * b) + ((newIndex % (b*c*d)) / (c*d))];

    newIndex += gridDim.x * blockDim.x;

  }

}

(I’m wondering a bit about all those divisions and modulus operations, which are slow and difficult to read. On a memory bandwidth bound kernel like this it probably doesn’t make a difference though.)

The kernel while being tested is indeed being called with a == b == c == d and errors start occurring when this size gets up above 20ish. I tried compiling with -Xptxas -dlcm=cg and the error persisted. When I changed to the gather kernel as opposed to a scatter the error remained but at a highly reduced frequency. With blocks set to 32 and the dimensions at a = b = c = d = 45, there were only about 100 faulty memory locations with a significant increase in error in the source array as opposed to the destination one. There were still errors across both however. Previously, there were around 10,000 errors in a matrix of that size with most in the destination array.

The math for the kernel index calculation is simply calculate the value of A B C and D and sum up the AnewStride + BnewStride …

oldX[aIndex * oldAStride + bIndex * oldBStride …]

newX ordering = CDAB

oldX ordering = ABCD

aIndex = (newIndex % (a*b)) / b

oldAStride = bcd

I’m fairly certain that is all correct and am not worried about the speed of the reordering so much as its accuracy since the dgemm takes significantly more time than the reordering. Most of the values are reordered correctly but having a few 0.00s and INFs in there really mess the entire thing up.

__global__ void permuteABCD_CDAB(double* newX, double* oldX, int a, int b, int c, int d) {

  int newIndex = blockIdx.x * blockDim.x + threadIdx.x;

while(newIndex < a*b*c*d) {

    newX[newIndex] = oldX[(((newIndex % (a*b)) / b) * b*c*d) + ((newIndex % b) * c*d) +

         ((newIndex / (a*b*d)) * d) + ((newIndex % (d*a*b)) / (a*b))];// = oldX[oldIndex];

    newIndex += gridDim.x * blockDim.x;

  }

}

Sorry, I didn’t intend to imply the index calculations are wrong (I think they are correct).

What do you mean by “errors in the source array”? Are the contents of [font=“Courier New”]oldX[/font] getting changed even though they should only be read?

Could you try running the code on the other device of the GTX590? Maybe you indeed have a defective device.

Well what do you know… device 1 works fine and doesn’t have these errors

Very interesting… thank you for the help

And yes there were errors in both the source and destination arrays of the re-ordering after running the kernel.