Hello,
Some short background on what I am doing…
I am optimizing a few tensor contraction kernels for GPUs and the generic algorithm is to reorder the n-dimensional tensor array so that the contraction indices align properly such that a matrix-multiply performs the contraction. Then another re-ordering is done on the resulting matrix to put it into the correct final form.
For example contracting over indices i and j in X1[a][i][c][j] * X2[b][j][i][d] = Y[a][b][c][d],
I would reorder to X1[i][j][a][c] * X2[b][d][i][j] = Y[b][d][a][c] and then reorder Y to the proper order
- note cublas is column-major ordered
The problem…
On a Tesla C2070 it works flawlessly while on a GeForce GTX 590 I get strange memory issues when I scale the number of thread blocks up greater than 1.
The problem is in the reordering kernel I have written and I do not understand why it works fine on the C2070 (computer capability 2.0) and not on the GTX 590 (also 2.0).
Below is a kernel to a specific case of reordering a 4D tensor. This one is simple enough and equivalent to simply performing a matrix transpose operation on the array as if it were a 2D ab by cd matrix.
a, b, c, and d are the dimension sizes and the length of the array is abc*d
newX and oldX are allocated buffers via cudaMalloc
oldX was copied into by cudaMemcpy from the host and is confirmed to store the correct values (in my test cases all 1.000s)
newX is where I want the reordered tensor to sit (in my test cases should again be all 1.000s)
__global__ void permuteABCD_CDAB(double* newX, double* oldX, int a, int b, int c, int d) {
int oldIndex = blockIdx.x * blockDim.x + threadIdx.x;
while(oldIndex < a*b*c*d) {
newX[(((oldIndex % (c*d)) / d) * a*b*d) + ((oldIndex % d) * a*b) +
((oldIndex / (b*c*d)) * b) + ((oldIndex % (b*c*d)) / (c*d))] = oldX[oldIndex];
oldIndex += gridDim.x * blockDim.x;
}
}
C2070: works perfect
GTX 590:
for permuteABCD_CDAB<<<1,256>>>(newX, oldX, a, b, c, d); it works perfect
for permuteABCD_CDAB<<<2,256>>>(newX, oldX, a, b, c, d); 0 to 5 indices are messed up
newX…oldX…broken index
0.250000, 1.000000, index=11085
0.003906, 1.000000, index=128554
for permuteABCD_CDAB<<<3,256>>>(newX, oldX, a, b, c, d); 10ish indices are messed up
0.250000, 1.000000, index=11085
0.062500, 1.000000, index=24460
0.000000, 1.000000, index=41107
0.000000, 1.000000, index=61725
0.000000, 1.000000, index=87669
0.500000, 1.000000, index=93196
1.000000, 0.062500, index=154410
0.062500, 1.000000, index=166242
0.003906, 1.000000, index=187258
0.000000, 1.000000, index=266921
and now we can see that 1 of the values in oldX has gone bad and while it doesn’t happen every time, similar indices tend to mess up more often such as above the index 11085 broke in 2 consecutive tries with a launch configuration of 2 and 3 blocks.
as I increase the number of blocks contributing to the re-ordering the number of errors in the values quickly ramps up with some errors being inf or nan and a lot being 0.000 or 0.500 or 0.25000 and there will be 1000’s of incorrect values
Again this is only on the GTX590, the C2070 works perfectly.
The last thing to add is that the C2070 has ECC enabled and the GTX590 currently has it turned off. I don’t know if that would have any effect on this but worth noting I guess. The main thing to take away is that scaling this kernel on the C2070 was fine with however many blocks I used while on the GTX 590 it is only working for 1 block.
Anyone able to make any sense of this?