Sigmoid kernel getting wrong results

Eder · May 13, 2016, 6:31pm

I have the following kernel:

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;

    j = j < cols ? j : cols - 1;
    i = i < rows ? i : rows - 1;
    
    int index = i * cols + j;

    dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );
}

and it’s called this way:

dim3 threadsPerBlock(32, 32);
dim3 blockSize( iDivUp( cols, threadsPerBlock.x ), iDivUp( rows, threadsPerBlock.y ) );
_sigmoidKernel<<<blockSize, threadsPerBlock>>>( rows, cols, src, dst );
cudaDeviceSynchronize();

I’m validating this kernel by comparing its results with a correct cpu version. The problem is that when src == dst, i.e., when input matrix and output matrix is the same I get wrong results in some values. For example:

src[i] == -0.366083
cpu result == 0.409488
gpu result == 0.600965

Not all values in the matrix are incorrect just a few. From a matrix of dimensions 3000x5000 I got around 30 wrong values.

It doesn’t seem to be a problem of read/write concurrency as far as I understand. I would appreciate any help to find the problem.

Thanks.

PS.: cuda-memcheck didn’t return any error.

Eder · May 13, 2016, 7:46pm

Something interesting I’ve just found: if I change the division operator by a __fdividef function or for a product operator (changing the cpu code also of course), I get the correct results.

I’m using CUDA release 6.5, V6.5.12 in a GTX 760.

Robert_Crovella · May 13, 2016, 7:50pm

This will certainly confuse the results:

j = j < cols ? j : cols - 1;
    i = i < rows ? i : rows - 1;

this allows various i,j indices to write their result to locations that you wouldn’t expect.

If your intent is to prevent out-of-bounds writes, just use:

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;
    
    if ((i < rows) && (j < cols)){
    
      int index = i * cols + j;

      dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );}
}

Eder · May 13, 2016, 9:01pm

Thanks txbob.

I’ve made the suggested modifications but the problem still occurs though less frequently.

Robert_Crovella · May 13, 2016, 9:33pm

My attempt to craft a complete code around what you have shown (plus my modified kernel) runs without error that I can see:

$ cat t1152.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define TOL 0.00001

const int trows = 3000;
const int tcols = 5000;
typedef float mytype;

template<typename T>
__global__ void _sigmoidKernel( int rows, int cols, const T* src, T* dst )
{
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    int i = blockIdx.y*blockDim.y + threadIdx.y;

    if ((i < rows) && (j < cols)){

      int index = i * cols + j;

      dst[index] = 1.0 / ( 1.0 + exp( -src[index] ) );}
}

template <typename T>
T cpu_sigmoid(T d){
  return 1.0 / ( 1.0 + exp( -d ) );
}

int main(){

  size_t dsize = trows*tcols*sizeof(mytype);
  mytype *h_data = (mytype *)malloc(dsize);
  mytype *r_data = (mytype *)malloc(dsize);
  mytype *d_data;
  cudaMalloc(&d_data, dsize);
  for (int i = 0; i < trows*tcols; i++)
    h_data[i] = rand()/(mytype)RAND_MAX;
  cudaMemcpy(d_data, h_data, dsize, cudaMemcpyHostToDevice);
  dim3 threadsPerBlock(32, 32);
  dim3 blockSize( (tcols+threadsPerBlock.x-1)/threadsPerBlock.x, (trows+threadsPerBlock.y-1)/threadsPerBlock.y );
  _sigmoidKernel<<<blockSize, threadsPerBlock>>>( trows, tcols, d_data, d_data );
  cudaDeviceSynchronize();
  cudaMemcpy(r_data, d_data, dsize, cudaMemcpyDeviceToHost);
  for (int i = 0; i < trows*tcols; i++)
    if(fabs(cpu_sigmoid(h_data[i])-r_data[i]) > TOL) {printf("mismatch at %d, cpu: %f, gpu: %f\n", i, cpu_sigmoid(h_data[i]), r_data[i]); return 1;}
  printf("Success!\n");
  for (int i = 0; i < 10; i++)
    printf("%d: in: %f  cpu: %f  gpu: %f\n", i, h_data[i], cpu_sigmoid(h_data[i]), r_data[i]);
  return 0;
}

$ nvcc -o t1152 t1152.cu
$ cuda-memcheck ./t1152
========= CUDA-MEMCHECK
Success!
0: in: 0.840188  cpu: 0.698505  gpu: 0.698505
1: in: 0.394383  cpu: 0.597337  gpu: 0.597337
2: in: 0.783099  cpu: 0.686348  gpu: 0.686348
3: in: 0.798440  cpu: 0.689641  gpu: 0.689641
4: in: 0.911647  cpu: 0.713337  gpu: 0.713337
5: in: 0.197551  cpu: 0.549228  gpu: 0.549228
6: in: 0.335223  cpu: 0.583030  gpu: 0.583030
7: in: 0.768230  cpu: 0.683138  gpu: 0.683138
8: in: 0.277775  cpu: 0.569001  gpu: 0.569001
9: in: 0.553970  cpu: 0.635056  gpu: 0.635056
========= ERROR SUMMARY: 0 errors
$

my guess is the problem lies in something you haven’t shown or communicated.

In the future, my suggestion would be to provide a short, completely worked example, just as I have done here, demonstrating your observed problem. This will allow others to help you most efficiently.

Eder · May 13, 2016, 10:10pm

I was trying to make an example like yours when I realized that I fixed the kernel with the ‘if’ clause you recommeded but didn’t removed the statements:

j = j < cols ? j : cols - 1;
i = i < rows ? i : rows - 1;

Now everything is working fine.

Thank you very much txbob.

Topic		Replies	Views
LARGE 2D arrays CUDA Programming and Performance	10	8555	August 11, 2011
Cuda giving wrong result CUDA Programming and Performance cuda	1	302	May 4, 2020
what's wrong with my code? CUDA Programming and Performance	6	3955	July 7, 2009
Wrong result after a certain number of iterations Execution doesn't give the same results as the CUDA Programming and Performance	11	4534	June 29, 2009
Desperate help needed! Sum rows of matrix CUDA Programming and Performance	16	40579	April 24, 2011
Varying results in different runs under the same condition CUDA Programming and Performance	3	423	September 5, 2019
Same Code (really, it is) - Much Different Results CUDA Programming and Performance	38	13850	September 30, 2010
Static shared memory Tiling (performance optimization) CUDA Programming and Performance	0	34	November 2, 2024
Issue with Nested for-loops in kernel CUDA Programming and Performance	13	2287	April 4, 2019
CUDA Matrix Multiplication Kernel Results Inconsistent when blockDim.z >1 CUDA Programming and Performance	2	738	January 28, 2018

Sigmoid kernel getting wrong results

Related topics