The attached program performs a basic copy between two areas of global device memory. When executed on a Tesla C1060 card the copy operation will fail with incorrect results after a seemingly random number of successive copies, on average it takes ~200 iterations to fail but it can sometimes fail in as few as 10 or as many 1000. When increasing the size of the array by a factor of 10 (from 157 x 176 to 157 x 1760) it will typically fail after only ~10 iterations.
The copy kernel assigns a single thread to each element of the array, the incorrect copies always appear to be the result from indices corresponding to the last two threads within a half-warp (see the offset % 16 column output). Furthermore, the errors typically occur on arrays where the number of columns is a multiple of 16 as well, although some sizes which are multiples of 16 do still pass, these cases may be because they are very small (for example 16 x 32 doesn’t seem to ever fail) and the error frequency seems to be proportional to array size.
The error also occurs in a 1-D copy with a 1-D grid and block, however, the error doesn’t seem to be restricted to lengths which are multiples of 16 or any other obvious factor which I could notice. A 1-D test case is also attached. Changing the block dimensions seems to have an effect as well, for example in the 1-D case any block size which is < 64 threads appears to work ok and in the 2-D case any block size for which the ‘x’ dimension of threads (in this program the ‘x’ dimension corresponds to along the row) is less than 16 also works ok.
This error doesn’t seem to occur when executed on a GeForce 9800 GX2 in either case. The error also doesn’t occur on either device in either case when copying arrays of float rather than cuComplex.