cannot resolve the error in running multi-block, mutli-threads kernel

Kilong · February 4, 2014, 9:40pm

Hi there,
I am running a CUDA code in Tesla C2075 card. I run the CUDA example to list the hardware parameters and it said that the card support 1024x1024x64 blocks and each block support 1024 threads (or 32x32 threads dim). I try the following code to use 600x600 blocks and 32x32 threads but it shows the error code saying that too many resources requested for launch. I don’t know why is that

#include <cuda.h>
#include <iostream>
#include "cuPrintf.cu"

using namespace std;

#define PI 4.0*atan(1.0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
 /* Check synchronous errors, i.e. pre-launch */ \
 cudaError_t err = cudaGetLastError(); \
 if (cudaSuccess != err) { \
 fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
 __FILE__, __LINE__, cudaGetErrorString(err) ); \
 exit(EXIT_FAILURE); \
 } \
 /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
 err = cudaThreadSynchronize(); \
 if (cudaSuccess != err) { \
 fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
 __FILE__, __LINE__, cudaGetErrorString( err) ); \
 exit(EXIT_FAILURE); \
 } \
} while (0)

int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  char str[1000];
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU, tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

bool cuPrintInit()  
{  
  cudaError_t err = cudaPrintfInit();  
  if(0 != strcmp("no error", cudaGetErrorString(err)))  return false;  
  return true;  
} 

__global__ void test(void)
{
  unsigned int x = blockIdx.x; 
  unsigned int y = blockIdx.y; 
  unsigned int w = threadIdx.x;
  unsigned int z = threadIdx.y;
  double X[1000], Y[1000];
  double phi=0;
  X[0] = x*0.1;
  Y[0] = y*0.1;
 
  cuPrintf("blockX=%d blockY=%d threadX=%d threadY=%d %d\n", x, y, w, z, sizeof(X)/sizeof(double));
  for (int n=0; n<1000; n++)
  {
    Y[n+1] = sin(X[n] + phi);
    X[n+1] = X[n];
  }
}

int main(void)
{
  if(!InitGPUSet())  puts("device is not ready!");  
  else if(!cuPrintInit())  puts("device is not ready!");  
  else  
  {  
    test<<<dim3(600, 600) , dim3(32, 32)>>>();
    CHECK_LAUNCH_ERROR();

    cudaPrintfDisplay(stdout, true);
    cudaPrintfEnd();
  }
}

What really confusing me is if I remove the line 68 “Y[n+1]=sin(X(n)+phi)” or replace it with “Y[n+1] = 1;” in the loop of the kernel, it is running without any error. So do you have any idea what to cause the error? Thanks.

pasoleatis · February 5, 2014, 12:43am

Are you kidding me?
double X[1000], Y[1000]???

Of course the code runs when you put it to zero it does not allocate the 2 arrays.

X Y would be allocated on the registers which are very few. You will be lucky if you could run 1 thread per block. For this case you have to make the loop with the sum fro 0 to n for 1 block and make a basic reduction in which you would calculate in several stages. First stage you add element [0] with [500] [2 ] with [501], [3] with [502] and so one. Nopw you have 500 elements instead of 100 and you do it again and again. Until you get 32 elements. It works the best when you have powers of 2 or below, becasue at each setp you halve the numebr of elements, but close.

Kilong · February 5, 2014, 1:17am

I didn’t know that the array allocated in the kernel will be on register. So does it mean it will solve the problem if I allocate it in device memory instead?

pasoleatis · February 5, 2014, 9:47am

Yes. And there is also shared memory. You should use all free types of memory so that you get max performance.
All variables allocated in a kernel are using registers. Shared memory arrays have shared on front and global arrays are declared outside of kernels. Making your code fast depends on your problem and the code you psoetd does not make sense since X[n+1]=X[n];. In general I would use shared memory and use 1 block for the loop n=0:1000.

Kilong · February 5, 2014, 2:02pm

I see. But using shared memory means the variable defined as shared will be modified by other threads also. In my code, I know it looks strange to have X[n+1]=X[n], what I am trying to show before is to tell even that simple code has error (but I didn’t know the register stuff before). In the modified code shown below

#include <cuda.h>
    #include <iostream>

    using namespace std;

    #define PI 4.0*atan(1.0)
    #define dx 0.1
    #define dy 0.2
    #define dK 0.3
    #define dA 0.4
    #define MS 32

    __constant__ double D[2];

    // Macro to catch CUDA errors in kernel launches
    #define CHECK_LAUNCH_ERROR() \
    do { \
    /* Check synchronous errors, i.e. pre-launch */ \
    cudaError_t err = cudaGetLastError(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString(err) ); \
    exit(EXIT_FAILURE); \
    } \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */ \
    err = cudaThreadSynchronize(); \
    if (cudaSuccess != err) { \
    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
    __FILE__, __LINE__, cudaGetErrorString( err) ); \
    exit(EXIT_FAILURE); \
    } \
    } while (0)

    int InitGPUSet()
    {
      char GPU[100] = "GPU: ";
      char str[1000];
      cudaDeviceProp tCard;
      int num = 0;
      if (cudaSuccess == cudaGetDeviceCount(&num))
      {
        for (int i = 0; i < num; ++ i)
        {
          cudaSetDevice(i);
          cudaGetDeviceProperties(&tCard, i);
          puts(strcat(GPU, tCard.name));
          sprintf(str, "Total registers per block: %d\n", tCard.regsPerBlock);
          puts(strcat(GPU, str));
        }
      }
      else return 0;
      return 1;
    }

    bool cuPrintInit()
    {
      cudaError_t err = cudaPrintfInit();
      if(0 != strcmp("no error", cudaGetErrorString(err))) return false;
      return true;
    }

    __global__ void test(const int nk, const int ma)
    {
      double X, Y, Z, A, K;
      double sum=0.0, sum2=0.0;
      double mean=0.0, stddev=0.0, slope=0;
      X = blockIdx.x*dx;
      Y = blockIdx.y*dy;
      K = (double)(nk*MS + threadIdx.x)*dK;
      A = (double)(ma*MS + threadIdx.y)*dA;

      for (int n=0; n<1000; n++)
      {
        Z = Y;
        Y = Y + K*(X + D[n%2]);
        X = X + Y;
        Z = Y-Z;
        sum += Z;
        sum2 += Z*Z;
        if ((n%1000==0) && (n!=0))
        {
          slope = (Y-(-PI + blockIdx.y*dy))/(double)n;
          mean = sum/(double)n;
          stddev = sqrt( (sum2 - 2*sum*mean + (double)(n-1)*mean*mean)/(double)(n-1) );
          if ((stddev>1) || (slope<=0.1)) {break;}
        }
      }
    }

    int main(void)
    {
      if(!InitGPUSet()) puts("device is not ready!");
      else if(!cuPrintInit()) puts("device is not ready!");
      else
      {
        D[0] = 1;
        D[1] = -1;
        test<<<dim3(600, 600) , dim3(MS , MS)>>>(1,1);
        CHECK_LAUNCH_ERROR();

        cudaPrintfDisplay(stdout, true);
        cudaPrintfEnd();
      }
    }

I have the recurrence relation modified and I compile the code with --ptxas-options=-v, it shows that there are totally 42 registers used. If I have total 32768 registers for each block and 32x32 threads used for each block, it means I need total 1764 registers. So the maximum threads I can use is basically 27x27, is that correct? I test it, and it didn’t show the error anymore.

As my understanding from your reply, if I want to use all those 32x32 threads, I should reduce the number registers used and trying to use either share or global variables. But since my case, each threads should keep it’s local copy of all variables in the kernel due to the recurrence. I should define all those variables in global (device) memory in array so access the corresponding element with threadID, e.g. X[1024] and in the kernel, I should use X[threadIdx.x*32+threadIdx.y] instead of X?

pasoleatis · February 5, 2014, 2:39pm

You can use a compiler flag to specify maximum amount of registers per thread. The rest of variable will spill to the local memory. These are competing optimizations, minimize the occupancy by running less threads using more registers or keep high occupancy per smp and have some spilled. If you have older version of cuda there is a flag —maxregcount while for cuda 5.5 you can use directly in the code for a specified kernel lauchbounds in order to force a specific number of block per smp.

You can try to place some of the variables in the shared memory, but for your code it does not look like it would change anything.
There are lines I am not sure if they are ok or not. You use the command break. since threads are executed in groups of 32 (warps) I wonder what happens to the rest of threads in this warp.

Try to get read of n%2. the modulo operation uses too many registers, better use an if and have the 32 threads using D[0] and the next 32 D[1].

Topic		Replies	Views
Fewer than MaxThreads per Block Fails (Code Included) 400 threads per block fails while 300 successf CUDA Programming and Performance	9	1104	March 24, 2011
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1087	February 2, 2014
threads and blocks CUDA Programming and Performance	3	1345	May 7, 2012
About effect of number of threads in a block, Tesla C2075 CUDA Programming and Performance	12	3470	December 28, 2012
Too Many Resources Requested CUDA Programming and Performance	8	1360	June 11, 2009
too many resources requested for launch CUDA Programming and Performance	28	24831	December 1, 2010
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4767	April 8, 2008
looking for further suggestion to speed up the code CUDA Programming and Performance	9	1266	February 4, 2014
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	25580	December 16, 2009
Cuda API error detected: cudaLaunchKernel returned (0x2bd) CUDA Programming and Performance	2	742	April 25, 2024

cannot resolve the error in running multi-block, mutli-threads kernel

Related topics