This code doesn't work maybe too much threads assigned?

Hi there,
I am porting a code form thrust to CUDA. I follow some examples online but I am not sure if this is correct or now since I am just starting. I have to following code compiled without any problem and I have the compiled code run in NVIDIA Tesla C2075, there is no error comes out in run time but I think there is something wrong with the code since the return data is not as what I expected.

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 600


int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, double *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  double A[100000], B[100000];
  double phi=0, sum=0.0, sum2=0.0, mean=0.0, slope=0.0, stddev=0.0;

  A[0] = A0;
  B[0] = B0;

  for (int n=0; n<100000-1; n++)
  {
    A[n+1] = A[n] + B[n];
    B[n+1] = B[n] + A[n+1];

    sum += A[n];
    sum2 += A[n]*A[n];
  }
  mean = sum/100000.0;
  stddev = (sum2 - 2*sum*mean + 100000.0*mean*mean)/(double)100000.0;
  if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;
}

int main(void)
{
  if(!InitGPUSet())  puts("device is not ready!");  
  else  
  {  
    int ret=0;
    double data[N*N];
    cudaEvent_t start, end;

    cout << "staring ..." << endl;
    cudaEventCreate(&start);
    cudaEventCreate(&end);
    cudaEventRecord(start,0);
    
    for (double ii=0; ii<=4; ii+=0.01)
    {
      for (double jj=-1.5; jj<=1.5; jj+=0.01) 
      {
        findStd<<<N, N>>>(ii, jj, &data[0]);
        cudaDeviceSynchronize();
      }
    }
    cudaEventRecord(end, 0);
    cudaEventSynchronize(end);
    float elapsed_time;
    cudaEventElapsedTime(&elapsed_time, start, end);
    cout << "Time: " << elapsed_time/1000 << " sec!" << endl;
  }
}

Here is a brief introduction of what the code did. I have a recurrence relation have A and B involved. For a given initial condition (A[0] and B[0]), I have it iterate 100000 time in a given way so I will return a 10000 array for A and B. I then calculate the standard deviation (stddev) on A and depends on the stddev, I will set the value at the corresponding cell in an output array. I have totally 600x600 initial conditions need to be tested so I have to run above function 600x600 times. To make it parallel, I setup above algorithm as a kernel and set the number of grid and number of threads to 600. But after I run the code, I find the returned array (data) have some random number instead. There must be something wrong but I cannot tell what’s going on there due to my limit experience in CUDA programming.

As a starting point I would suggest checking the status return of every CUDA API call and every kernel launch. Are any errors reported when you do that? For checking errors after a kernel launch, you could use the macro I posted in the following forum thread:

[url]How to debug kernel throwing an exception? - CUDA Programming and Performance - NVIDIA Developer Forums

Thanks. I copy and paste the macro in my code and run it after the kernel launch as follows

findStd<<<N, N>>>(ii, jj, &data[0]);
CHECK_LAUNCH_ERROR();

and after running, it shows

Cuda error in file ‘test.cu’ in line 88: invalid argument

You would also want to add status checking to each CUDA API call. The CHECK_LAUNCH_ERROR() macro is likely reporting an error on a previous CUDA API call. You would want to find out which call is the one that has the invalid argument, then debug from there.

OK, to simplify the problem. I remove all unrelated CUDA API and get the following code

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 600

    // Macro to catch CUDA errors in kernel launches
    #define CHECK_LAUNCH_ERROR() \
    do { \
    /* Check synchronous errors, i.e. pre-launch */ \
    cudaError_t err = cudaGetLastError(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString(err) ); \
    exit(EXIT_FAILURE); \
    } \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
    err = cudaThreadSynchronize(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString( err) ); \
    exit(EXIT_FAILURE); \
    } \
    } while (0)

int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, double *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  double A[100000], B[100000];
  double phi=0, sum=0.0, sum2=0.0, mean=0.0, slope=0.0, stddev=0.0;

  A[0] = A0;
  B[0] = B0;

  for (int n=0; n<100000-1; n++)
  {
    A[n+1] = A[n] + B[n];
    B[n+1] = B[n] + A[n+1];

    sum += A[n];
    sum2 += A[n]*A[n];
  }
  mean = sum/100000.0;
  stddev = (sum2 - 2*sum*mean + 100000.0*mean*mean)/(double)100000.0;
  if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;
}

int main(void)
{
  if(!InitGPUSet())  
  {
    puts("device is not ready!");  
    cout << "error" << endl;
  }
  else  
  {  
    double data[N*N];
    findStd<<<2, 2>>>(0, 0, &data[0]);
    CHECK_LAUNCH_ERROR();
  }
}

But it still report “Cuda error in file ‘test1.cu’ in line 81: invalid argument”, where line 81 is where I launch “CHECK_LAUNCH_ERROR()”

Have you looked at the status returned by the CUDA API calls inside InitGPUSet()?

You may also want to experiment with smaller array sizes in the kernel,

double A[100000], B[100000];

although I would expect a different error message if there were a problem with those (“out of resources”).

I checked the status of each CUDA call in InitGPUSet, they all return cudaSuccess. And then I try to make the size of A and B to 1000, then the error from CHECK_LAUNCH_ERROR() becomes

Cuda error in file ‘test.cu’ in line 83: unspecified launch failure.

I also try to remove the follow code piece from the kernal, the error then gone

if (stddev>=5) data[y+x*N] = 255.0;
  else data[y+x*N] = -12.0;

So why this code cause problem? Here N is 600, the size of data is initialized to be 600x600, x and y are in the range 0 to 599 so y+xN should be in 0 to (600600-1). So why this one cause problem?

If you change the array sizes without adjusting the rest of the code, you will likely get memory accesses out of bounds (the cuda-memcheck tool can help you find where they occur), and just like a host program would throw a segfault in such a case, the GPU returns an ULF.

An alternative way to get started programming with CUDA (or any new programming environment, for that matter) may be to start with a minimal working application, and then keep adding to it. With that approach would start with a working program at each step. The book “CUDA by Example” could be a good starting point.

Thanks for the reply. I think I figure out what’s going on. I read an article online and it seems that I need to allocate memory for data in device space first but I just pass a data array which was assigned in the host space.