Sigmoid kernel getting wrong results

I have the following kernel:

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;

    j = j < cols ? j : cols - 1;
    i = i < rows ? i : rows - 1;
    
    int index = i * cols + j;

    dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );
}

and it’s called this way:

dim3 threadsPerBlock(32, 32);
dim3 blockSize( iDivUp( cols, threadsPerBlock.x ), iDivUp( rows, threadsPerBlock.y ) );
_sigmoidKernel<<<blockSize, threadsPerBlock>>>( rows, cols, src, dst );
cudaDeviceSynchronize();

I’m validating this kernel by comparing its results with a correct cpu version. The problem is that when src == dst, i.e., when input matrix and output matrix is the same I get wrong results in some values. For example:

src[i] == -0.366083
cpu result == 0.409488
gpu result == 0.600965

Not all values in the matrix are incorrect just a few. From a matrix of dimensions 3000x5000 I got around 30 wrong values.

It doesn’t seem to be a problem of read/write concurrency as far as I understand. I would appreciate any help to find the problem.

Thanks.

PS.: cuda-memcheck didn’t return any error.

Something interesting I’ve just found: if I change the division operator by a __fdividef function or for a product operator (changing the cpu code also of course), I get the correct results.

I’m using CUDA release 6.5, V6.5.12 in a GTX 760.

This will certainly confuse the results:

j = j < cols ? j : cols - 1;
    i = i < rows ? i : rows - 1;

this allows various i,j indices to write their result to locations that you wouldn’t expect.

If your intent is to prevent out-of-bounds writes, just use:

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;
    
    if ((i < rows) && (j < cols)){
    
      int index = i * cols + j;

      dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );}
}

Thanks txbob.

I’ve made the suggested modifications but the problem still occurs though less frequently.

My attempt to craft a complete code around what you have shown (plus my modified kernel) runs without error that I can see:

$ cat t1152.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define TOL 0.00001

const int trows = 3000;
const int tcols = 5000;
typedef float mytype;

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;

    if ((i < rows) && (j < cols)){

      int index = i * cols + j;

      dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );}
}

template <typename T>
T cpu_sigmoid(T d){
  return 1.0 / ( 1.0 + exp( -d ) );
}

int main(){

  size_t dsize = trows*tcols*sizeof(mytype);
  mytype *h_data = (mytype *)malloc(dsize);
  mytype *r_data = (mytype *)malloc(dsize);
  mytype *d_data;
  cudaMalloc(&d_data, dsize);
  for (int i = 0; i < trows*tcols; i++)
    h_data[i] = rand()/(mytype)RAND_MAX;
  cudaMemcpy(d_data, h_data, dsize, cudaMemcpyHostToDevice);
  dim3 threadsPerBlock(32, 32);
  dim3 blockSize( (tcols+threadsPerBlock.x-1)/threadsPerBlock.x, (trows+threadsPerBlock.y-1)/threadsPerBlock.y );
  _sigmoidKernel<<<blockSize, threadsPerBlock>>>( trows, tcols, d_data, d_data );
  cudaDeviceSynchronize();
  cudaMemcpy(r_data, d_data, dsize, cudaMemcpyDeviceToHost);
  for (int i = 0; i < trows*tcols; i++)
    if(fabs(cpu_sigmoid(h_data[i])-r_data[i]) > TOL) {printf("mismatch at %d, cpu: %f, gpu: %f\n", i, cpu_sigmoid(h_data[i]), r_data[i]); return 1;}
  printf("Success!\n");
  for (int i = 0; i < 10; i++)
    printf("%d: in: %f  cpu: %f  gpu: %f\n", i, h_data[i], cpu_sigmoid(h_data[i]), r_data[i]);
  return 0;
}

$ nvcc -o t1152 t1152.cu
$ cuda-memcheck ./t1152
========= CUDA-MEMCHECK
Success!
0: in: 0.840188  cpu: 0.698505  gpu: 0.698505
1: in: 0.394383  cpu: 0.597337  gpu: 0.597337
2: in: 0.783099  cpu: 0.686348  gpu: 0.686348
3: in: 0.798440  cpu: 0.689641  gpu: 0.689641
4: in: 0.911647  cpu: 0.713337  gpu: 0.713337
5: in: 0.197551  cpu: 0.549228  gpu: 0.549228
6: in: 0.335223  cpu: 0.583030  gpu: 0.583030
7: in: 0.768230  cpu: 0.683138  gpu: 0.683138
8: in: 0.277775  cpu: 0.569001  gpu: 0.569001
9: in: 0.553970  cpu: 0.635056  gpu: 0.635056
========= ERROR SUMMARY: 0 errors
$

my guess is the problem lies in something you haven’t shown or communicated.

In the future, my suggestion would be to provide a short, completely worked example, just as I have done here, demonstrating your observed problem. This will allow others to help you most efficiently.

I was trying to make an example like yours when I realized that I fixed the kernel with the ‘if’ clause you recommeded but didn’t removed the statements:

j = j < cols ? j : cols - 1;
i = i < rows ? i : rows - 1;

Now everything is working fine.

Thank you very much txbob.