This code doesn't work maybe too much threads assigned?

Kilong · February 2, 2014, 7:25pm

Hi there,
I am porting a code form thrust to CUDA. I follow some examples online but I am not sure if this is correct or now since I am just starting. I have to following code compiled without any problem and I have the compiled code run in NVIDIA Tesla C2075, there is no error comes out in run time but I think there is something wrong with the code since the return data is not as what I expected.

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 600


int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, double *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  double A[100000], B[100000];
  double phi=0, sum=0.0, sum2=0.0, mean=0.0, slope=0.0, stddev=0.0;

  A[0] = A0;
  B[0] = B0;

  for (int n=0; n<100000-1; n++)
  {
    A[n+1] = A[n] + B[n];
    B[n+1] = B[n] + A[n+1];

    sum += A[n];
    sum2 += A[n]*A[n];
  }
  mean = sum/100000.0;
  stddev = (sum2 - 2*sum*mean + 100000.0*mean*mean)/(double)100000.0;
  if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;
}

int main(void)
{
  if(!InitGPUSet())  puts("device is not ready!");  
  else  
  {  
    int ret=0;
    double data[N*N];
    cudaEvent_t start, end;

    cout << "staring ..." << endl;
    cudaEventCreate(&start);
    cudaEventCreate(&end);
    cudaEventRecord(start,0);
    
    for (double ii=0; ii<=4; ii+=0.01)
    {
      for (double jj=-1.5; jj<=1.5; jj+=0.01) 
      {
        findStd<<<N, N>>>(ii, jj, &data[0]);
        cudaDeviceSynchronize();
      }
    }
    cudaEventRecord(end, 0);
    cudaEventSynchronize(end);
    float elapsed_time;
    cudaEventElapsedTime(&elapsed_time, start, end);
    cout << "Time: " << elapsed_time/1000 << " sec!" << endl;
  }
}

Here is a brief introduction of what the code did. I have a recurrence relation have A and B involved. For a given initial condition (A[0] and B[0]), I have it iterate 100000 time in a given way so I will return a 10000 array for A and B. I then calculate the standard deviation (stddev) on A and depends on the stddev, I will set the value at the corresponding cell in an output array. I have totally 600x600 initial conditions need to be tested so I have to run above function 600x600 times. To make it parallel, I setup above algorithm as a kernel and set the number of grid and number of threads to 600. But after I run the code, I find the returned array (data) have some random number instead. There must be something wrong but I cannot tell what’s going on there due to my limit experience in CUDA programming.

njuffa · February 2, 2014, 7:42pm

As a starting point I would suggest checking the status return of every CUDA API call and every kernel launch. Are any errors reported when you do that? For checking errors after a kernel launch, you could use the macro I posted in the following forum thread:

[url]How to debug kernel throwing an exception? - CUDA Programming and Performance - NVIDIA Developer Forums

Kilong · February 2, 2014, 7:40pm

Thanks. I copy and paste the macro in my code and run it after the kernel launch as follows

findStd<<<N, N>>>(ii, jj, &data[0]);
CHECK_LAUNCH_ERROR();

and after running, it shows

Cuda error in file ‘test.cu’ in line 88: invalid argument

njuffa · February 2, 2014, 7:51pm

You would also want to add status checking to each CUDA API call. The CHECK_LAUNCH_ERROR() macro is likely reporting an error on a previous CUDA API call. You would want to find out which call is the one that has the invalid argument, then debug from there.

Kilong · February 2, 2014, 7:54pm

OK, to simplify the problem. I remove all unrelated CUDA API and get the following code

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 600

    // Macro to catch CUDA errors in kernel launches
    #define CHECK_LAUNCH_ERROR() \
    do { \
    /* Check synchronous errors, i.e. pre-launch */ \
    cudaError_t err = cudaGetLastError(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString(err) ); \
    exit(EXIT_FAILURE); \
    } \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
    err = cudaThreadSynchronize(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString( err) ); \
    exit(EXIT_FAILURE); \
    } \
    } while (0)

int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, double *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  double A[100000], B[100000];
  double phi=0, sum=0.0, sum2=0.0, mean=0.0, slope=0.0, stddev=0.0;

  A[0] = A0;
  B[0] = B0;

  for (int n=0; n<100000-1; n++)
  {
    A[n+1] = A[n] + B[n];
    B[n+1] = B[n] + A[n+1];

    sum += A[n];
    sum2 += A[n]*A[n];
  }
  mean = sum/100000.0;
  stddev = (sum2 - 2*sum*mean + 100000.0*mean*mean)/(double)100000.0;
  if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;
}

int main(void)
{
  if(!InitGPUSet())  
  {
    puts("device is not ready!");  
    cout << "error" << endl;
  }
  else  
  {  
    double data[N*N];
    findStd<<<2, 2>>>(0, 0, &data[0]);
    CHECK_LAUNCH_ERROR();
  }
}

But it still report “Cuda error in file ‘test1.cu’ in line 81: invalid argument”, where line 81 is where I launch “CHECK_LAUNCH_ERROR()”

njuffa · February 2, 2014, 8:10pm

Have you looked at the status returned by the CUDA API calls inside InitGPUSet()?

You may also want to experiment with smaller array sizes in the kernel,

double A[100000], B[100000];

although I would expect a different error message if there were a problem with those (“out of resources”).

Kilong · February 2, 2014, 8:13pm

I checked the status of each CUDA call in InitGPUSet, they all return cudaSuccess. And then I try to make the size of A and B to 1000, then the error from CHECK_LAUNCH_ERROR() becomes

Cuda error in file ‘test.cu’ in line 83: unspecified launch failure.

I also try to remove the follow code piece from the kernal, the error then gone

if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;

So why this code cause problem? Here N is 600, the size of data is initialized to be 600x600, x and y are in the range 0 to 599 so y+xN should be in 0 to (600600-1). So why this one cause problem?

njuffa · February 2, 2014, 8:53pm

If you change the array sizes without adjusting the rest of the code, you will likely get memory accesses out of bounds (the cuda-memcheck tool can help you find where they occur), and just like a host program would throw a segfault in such a case, the GPU returns an ULF.

An alternative way to get started programming with CUDA (or any new programming environment, for that matter) may be to start with a minimal working application, and then keep adding to it. With that approach would start with a working program at each step. The book “CUDA by Example” could be a good starting point.

Kilong · February 2, 2014, 10:43pm

Thanks for the reply. I think I figure out what’s going on. I read an article online and it seems that I need to allocate memory for data in device space first but I just pass a data array which was assigned in the host space.

Topic		Replies	Views
cannot resolve the error in running multi-block, mutli-threads kernel CUDA Programming and Performance	5	1062	February 5, 2014
Cuda code performance CUDA Programming and Performance	14	3149	December 16, 2014
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7946	June 14, 2013
Silent kernel failure CUDA Programming and Performance	25	8307	May 18, 2020
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1786	November 4, 2016
code that crashes unpredictably CUDA Programming and Performance	15	12636	April 28, 2010
An Easy Introduction to CUDA C and C++ Technical Blog	48	1246	July 19, 2018
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2801	September 10, 2010
help to clairfy usage of number of grids and number of blocks in kernal CUDA Programming and Performance	0	611	February 14, 2014
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8621	February 17, 2010

This code doesn't work maybe too much threads assigned?

Related topics