For a device function like this:
void __global__ ComputeOutput(float * const C,int const num_in)
{
// Grid-Stride Loops
// learnt from https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int j_ = blockIdx.x * blockDim.x + threadIdx.x;
j_ < num_in;
j_ += blockDim.x * gridDim.x) {
C[ j_ ] = float(j_);
}
}
is it possible that the output C[j] != j?
I encountered this issue: Most C[j] are j, but a few of them are not j.
The bug is present even if I launch the kernel with 1 thread
ComputeOutput<<<1,1>>>( d_C, num_in);
You can reproduce the error using my code:
My environment is Matlab 2017a, Ubuntu 16.04 64-bit, CUDA-8.0, Tesla K80.
Update: I do find the error only occurs when j is relatively large (in the order of 16 millions). It’s common for me to deal with such large numbers.