The following testcase has been extracted from the port of a recursive CPU algorithm that multiplies several 4 x 4 matrices. It’s surely not the best way to do so but the point is that I’m trying to understand the output of CUDA status
at the end of the program:
__device__ void _rec(int N, double Lout[4][4]) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double x1 = 1.0;
double x2 = -0.5;
double x3 = 0.25;
double x4 = -1.5;
// Leaving out some of the sin and cos -> no error
double a1 = sin(x1);
double a2 = sin(x2);
double a3 = sin(x3);
double a4 = cos(x3 * x4);
double a5 = cos(x1 * x3);
double a6 = cos(x1 * x4);
double a7 = cos(x2 * x3);
double a8 = cos(x2 * x4);
// Replacing most "/" with "*" -> no error
double a9 = cos(x1 / x2);
double a10 = sin(x1 / x3);
double a11 = sin(x1 / x4);
double a12 = sin(x2 / x3);
double a13 = sin(x2 / x4);
double a14 = sin(x3 / x4);
double a15 = x1 * x2 * x3;
double a16 = x2 * x3 * x4;
double L1[4][4];
double L2[4][4];
L1[0][0] = a1;
L1[0][1] = a2;
L1[0][2] = a3;
L1[0][3] = a4;
L1[1][0] = a5;
L1[1][1] = a6;
L1[1][2] = a7;
L1[1][3] = a8;
L1[2][0] = a9;
L1[2][1] = a10;
L1[2][2] = a11;
L1[2][3] = a12;
L1[3][0] = a13;
L1[3][1] = a14;
L1[3][2] = a15;
L1[3][3] = a16;
L2[0][0] = a1;
L2[0][1] = a2;
L2[0][2] = a3;
L2[0][3] = a4;
L2[1][0] = a5;
L2[1][1] = a6;
L2[1][2] = a7;
L2[1][3] = a8;
L2[2][0] = a9;
L2[2][1] = a10;
L2[2][2] = a11;
L2[2][3] = a12;
L2[3][0] = a13;
L2[3][1] = a14;
L2[3][2] = a15;
L2[3][3] = a16;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
for (int k = 0; k < 4; k++) {
Lout[i][j] += L1[i][k] * L2[k][j];
}
}
}
// This recursion call is necessary to trigger the error, although N == 1.
if (N > 1) _rec(N-1, Lout);
}
__global__ void do_rec (int N, double *L) {
struct boost {double l[4][4];} *LL = (struct boost *)&L;
_rec(N, LL->l);
}
int main (int argc, char *argv[]) {
int n_blocks = argc > 1 ? atoi(argv[1]): 1000;
int n_threads = 1024;
cudaThreadSetLimit (cudaLimitStackSize, 10 * 1024);
double *L;
cudaMalloc((void**)&L, 16 * sizeof(double));
cudaMemset (L, 0, 16 * sizeof(double));
// N == 1, only one recursion level
do_rec<<<n_blocks,n_threads>>>(1, L);
cudaDeviceSynchronize();
printf ("Cuda status: %s\n", cudaGetErrorString(cudaGetLastError()));
}
The error message at the end is too many resources requested for launch
. From what I could gather online, I understand that the issue probably arises from the large number of variables which need to be allocated on the stack, so that there are not enough registers to accommodate them physically on the GPU.
The example above is very sensitive w.r.t. code changes when it comes to reproducing the issue. For example, leaving out some of the trigonometric function calls and replacing them by their arguments results in no error
. My guess is that this is due to the additional stack allocation by these functions. Moreover, replacing some of the divisions in the arguments (a9
to a14
) by multiplications also makes the error vanish. This I cannot explain. Finally, although the recursive function call at the end is not executed because N == 1, leaving it out results in a successful run. I guess that since the compiler does not know this it creates some additional stack space for this function call.
Can you confirm that my assumptions about the origin of the error make sense? Do you have an explanation for the issue regarding the division? Also, which hardware resource limit is related to this problem? If I knew which quantity to look at, I could try to optimize with regards to saving this resource and estimate how many threads can be used to make a given piece of code run.