cannot resolve the error in running multi-block, mutli-threads kernel

Hi there,
I am running a CUDA code in Tesla C2075 card. I run the CUDA example to list the hardware parameters and it said that the card support 1024x1024x64 blocks and each block support 1024 threads (or 32x32 threads dim). I try the following code to use 600x600 blocks and 32x32 threads but it shows the error code saying that too many resources requested for launch. I don’t know why is that

#include <cuda.h>
#include <iostream>
#include "cuPrintf.cu"

using namespace std;

#define PI 4.0*atan(1.0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
 /* Check synchronous errors, i.e. pre-launch */ \
 cudaError_t err = cudaGetLastError(); \
 if (cudaSuccess != err) { \
 fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
 __FILE__, __LINE__, cudaGetErrorString(err) ); \
 exit(EXIT_FAILURE); \
 } \
 /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
 err = cudaThreadSynchronize(); \
 if (cudaSuccess != err) { \
 fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
 __FILE__, __LINE__, cudaGetErrorString( err) ); \
 exit(EXIT_FAILURE); \
 } \
} while (0)

int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  char str[1000];
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU, tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

bool cuPrintInit()  
{  
  cudaError_t err = cudaPrintfInit();  
  if(0 != strcmp("no error", cudaGetErrorString(err)))  return false;  
  return true;  
} 

__global__ void test(void)
{
  unsigned int x = blockIdx.x; 
  unsigned int y = blockIdx.y; 
  unsigned int w = threadIdx.x;
  unsigned int z = threadIdx.y;
  double X[1000], Y[1000];
  double phi=0;
  X[0] = x*0.1;
  Y[0] = y*0.1;
 
  cuPrintf("blockX=%d blockY=%d threadX=%d threadY=%d %d\n", x, y, w, z, sizeof(X)/sizeof(double));
  for (int n=0; n<1000; n++)
  {
    Y[n+1] = sin(X[n] + phi);
    X[n+1] = X[n];
  }
}

int main(void)
{
  if(!InitGPUSet())  puts("device is not ready!");  
  else if(!cuPrintInit())  puts("device is not ready!");  
  else  
  {  
    test<<<dim3(600, 600) , dim3(32, 32)>>>();
    CHECK_LAUNCH_ERROR();

    cudaPrintfDisplay(stdout, true);
    cudaPrintfEnd();
  }
}

What really confusing me is if I remove the line 68 “Y[n+1]=sin(X(n)+phi)” or replace it with “Y[n+1] = 1;” in the loop of the kernel, it is running without any error. So do you have any idea what to cause the error? Thanks.

Are you kidding me?
double X[1000], Y[1000]???

Of course the code runs when you put it to zero it does not allocate the 2 arrays.

X Y would be allocated on the registers which are very few. You will be lucky if you could run 1 thread per block. For this case you have to make the loop with the sum fro 0 to n for 1 block and make a basic reduction in which you would calculate in several stages. First stage you add element [0] with [500] [2 ] with [501], [3] with [502] and so one. Nopw you have 500 elements instead of 100 and you do it again and again. Until you get 32 elements. It works the best when you have powers of 2 or below, becasue at each setp you halve the numebr of elements, but close.

I didn’t know that the array allocated in the kernel will be on register. So does it mean it will solve the problem if I allocate it in device memory instead?

Yes. And there is also shared memory. You should use all free types of memory so that you get max performance.
All variables allocated in a kernel are using registers. Shared memory arrays have shared on front and global arrays are declared outside of kernels. Making your code fast depends on your problem and the code you psoetd does not make sense since X[n+1]=X[n];. In general I would use shared memory and use 1 block for the loop n=0:1000.

I see. But using shared memory means the variable defined as shared will be modified by other threads also. In my code, I know it looks strange to have X[n+1]=X[n], what I am trying to show before is to tell even that simple code has error (but I didn’t know the register stuff before). In the modified code shown below

#include <cuda.h>
    #include <iostream>

    using namespace std;

    #define PI 4.0*atan(1.0)
    #define dx 0.1
    #define dy 0.2
    #define dK 0.3
    #define dA 0.4
    #define MS 32

    __constant__ double D[2];

    // Macro to catch CUDA errors in kernel launches
    #define CHECK_LAUNCH_ERROR() \
    do { \
    /* Check synchronous errors, i.e. pre-launch */ \
    cudaError_t err = cudaGetLastError(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString(err) ); \
    exit(EXIT_FAILURE); \
    } \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
    err = cudaThreadSynchronize(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString( err) ); \
    exit(EXIT_FAILURE); \
    } \
    } while (0)

    int InitGPUSet()
    {
      char GPU[100] = "GPU: ";
      char str[1000];
      cudaDeviceProp tCard;
      int num = 0;
      if (cudaSuccess == cudaGetDeviceCount(&num))
      {
        for (int i = 0; i < num; ++ i)
        {
          cudaSetDevice(i);
          cudaGetDeviceProperties(&tCard, i);
          puts(strcat(GPU, tCard.name));
          sprintf(str, "Total registers per block: %d\n", tCard.regsPerBlock);
          puts(strcat(GPU, str));
        }
      }
      else return 0;
      return 1;
    }

    bool cuPrintInit()
    {
      cudaError_t err = cudaPrintfInit();
      if(0 != strcmp("no error", cudaGetErrorString(err))) return false;
      return true;
    }

    __global__ void test(const int nk, const int ma)
    {
      double X, Y, Z, A, K;
      double sum=0.0, sum2=0.0;
      double mean=0.0, stddev=0.0, slope=0;
      X = blockIdx.x*dx;
      Y = blockIdx.y*dy;
      K = (double)(nk*MS + threadIdx.x)*dK;
      A = (double)(ma*MS + threadIdx.y)*dA;

      for (int n=0; n<1000; n++)
      {
        Z = Y;
        Y = Y + K*(X + D[n%2]);
        X = X + Y;
        Z = Y-Z;
        sum += Z;
        sum2 += Z*Z;
        if ((n%1000==0) && (n!=0))
        {
          slope = (Y-(-PI + blockIdx.y*dy))/(double)n;
          mean = sum/(double)n;
          stddev = sqrt( (sum2 - 2*sum*mean + (double)(n-1)*mean*mean)/(double)(n-1) );
          if ((stddev>1) || (slope<=0.1)) {break;}
        }
      }
    }

    int main(void)
    {
      if(!InitGPUSet()) puts("device is not ready!");
      else if(!cuPrintInit()) puts("device is not ready!");
      else
      {
        D[0] = 1;
        D[1] = -1;
        test<<<dim3(600, 600) , dim3(MS , MS)>>>(1,1);
        CHECK_LAUNCH_ERROR();

        cudaPrintfDisplay(stdout, true);
        cudaPrintfEnd();
      }
    }

I have the recurrence relation modified and I compile the code with --ptxas-options=-v, it shows that there are totally 42 registers used. If I have total 32768 registers for each block and 32x32 threads used for each block, it means I need total 1764 registers. So the maximum threads I can use is basically 27x27, is that correct? I test it, and it didn’t show the error anymore.

As my understanding from your reply, if I want to use all those 32x32 threads, I should reduce the number registers used and trying to use either share or global variables. But since my case, each threads should keep it’s local copy of all variables in the kernel due to the recurrence. I should define all those variables in global (device) memory in array so access the corresponding element with threadID, e.g. X[1024] and in the kernel, I should use X[threadIdx.x*32+threadIdx.y] instead of X?

You can use a compiler flag to specify maximum amount of registers per thread. The rest of variable will spill to the local memory. These are competing optimizations, minimize the occupancy by running less threads using more registers or keep high occupancy per smp and have some spilled. If you have older version of cuda there is a flag —maxregcount while for cuda 5.5 you can use directly in the code for a specified kernel lauchbounds in order to force a specific number of block per smp.

You can try to place some of the variables in the shared memory, but for your code it does not look like it would change anything.
There are lines I am not sure if they are ok or not. You use the command break. since threads are executed in groups of 32 (warps) I wonder what happens to the rest of threads in this warp.

Try to get read of n%2. the modulo operation uses too many registers, better use an if and have the 32 threads using D[0] and the next 32 D[1].