cudaErrorLaunchOutOfResources(701) when launching global function

1124171308 · March 28, 2021, 7:47am

I’m using cuda 10.2 for a project(sm_61, compute_61). Here’s a part of my code.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cstdio>
#include <cstdlib>

constexpr unsigned int CUDA_MAX_THREAD_NUM = 256;
#define Block_Count(tot_thrd) (((tot_thrd) + CUDA_MAX_THREAD_NUM - 1u) / CUDA_MAX_THREAD_NUM)

static void HandleError(cudaError_t err, const char *file, int line) {
	if (err != cudaSuccess) {
		fprintf(stderr, "Error %d: \"%s\" in %s at line %d\n", int(err), cudaGetErrorString(err), file, line);
		exit(3);
	}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))

template <typename T>
__global__ void subConstructArray(T* dsts, size_t Len) {
	size_t Idi = blockIdx.x * CUDA_MAX_THREAD_NUM + threadIdx.x;
	if (Idi >= Len) return;

	new (dsts + Idi)T;
}

template <typename T>
__global__ void subDestructArray(T* dsts, size_t Len) {
	size_t Idi = blockIdx.x * CUDA_MAX_THREAD_NUM + threadIdx.x;
	if (Idi >= Len) return;

	(dsts + Idi)->~T();
}

template <typename T>
void ConstructArray(T*& dsts, size_t Len) {
	HANDLE_ERROR(cudaMalloc(&dsts, sizeof(T) * Len));
	subConstructArray<<<Block_Count(Len), CUDA_MAX_THREAD_NUM>>>
		(dsts, Len);
	cudaDeviceSynchronize();
	HANDLE_ERROR(cudaGetLastError());
}

template <typename T>
void DestructArray(T* dsts, size_t Len) {
	subDestructArray<<<Block_Count(Len), CUDA_MAX_THREAD_NUM>>>
		(dsts, Len);
	cudaDeviceSynchronize();
	HANDLE_ERROR(cudaGetLastError());  //cudaErrorLaunchOutOfResources Here
	HANDLE_ERROR(cudaFree(dsts));
}

template<typename T>
class tensor {
	private:
		T* _elems;
		int _Ply;
		int _Height;
		int _Width;
	public:
		__device__ tensor() : _elems(nullptr), _Ply(0), _Height(0), _Width(0) {}
		__device__ ~tensor() { free(_elems); }

		__device__ void Set_size(int _ply, int _hgt, int _wid) {
			if (_ply * _hgt * _wid != _Ply * _Height * _Width) {
				free(_elems);
				_elems = (T*)malloc(sizeof(T) * _ply * _hgt * _wid);
			}
			_Ply = _ply;
			_Height = _hgt;
			_Width = _wid;
		}
};

size_t Tot_Thrd = 256;
tensor<double>* out;
tensor<double>* er;

void Set_Tot_Thrd(size_t _num_of_thrds) {
	DestructArray(out, Tot_Thrd);
	DestructArray(er, Tot_Thrd);
	ConstructArray(out, _num_of_thrds);
	ConstructArray(er, _num_of_thrds);
	Tot_Thrd = _num_of_thrds;
}

int main() {
	ConstructArray(out, Tot_Thrd);
	ConstructArray(er, Tot_Thrd);
	Set_Tot_Thrd(16);
}

But when I call function “Set_Tot_Thrd(8)”, HANDLE_ERROR(cudaGetLastError()) in host function DestructArray() turns out to be cudaErrorLaunchOutOfResources (701) - “too many resources requested for launch”. I used to think that the number of registers the function “global void subDestructArray” need is larger than the number of registers available per block. So I ran deviceQuery:

And here’s ptxas info:

256*17= 4352 < 65536. So the number of registers did not exceed the max limit.

Also I added -maxrregcount=16 but after compiling the same error “cudaErrorLaunchOutOfResources” occurred when running the program. Neither cuda-memcheck nor ptxas gave me helpful answer.

It seems like that cudaErrorLaunchOutOfResources means the number of not only registers but something else exceeded. But I have no idea what runs out when I call the function.

I’m desperate to solve this problem. What should I do? Many thanks in advance.

njuffa · March 28, 2021, 9:22am

[Please post text as text, not as images. This avoids issues with technologies that assist the visually impaired]

It’s best to post a minimal self-contained reproducer code that others can compile and run. Quick sanity check: Are you building the code for the correct target architecture, i.e. sm_61?

Agreed that it would be more helpful if “cudaErrorLaunchOutOfResources” provided a sub-code that would point out which specific resource limit was exceeded. You may want to file an enhancement request with NVIDIA for this. You can do so by using the bug reporting form and prefixing the synopsis with “RFE:” to mark it as an enhancement request.

Generally speaking the out of resources error is triggered by (1) exceeding maximum block or grid dimensions (2) exceeding shared memory size (3) exceeding available registers per block.

It should be noted that the hardware allocated registers with a granularity greater than 1, so simple multiplication of the per-thread register usage with the number of threads will in most cases underestimate the registered allocated. Use the occupancy calculator instead:

1124171308 · March 28, 2021, 9:34am

Thanks for replying. I’ve just edited the code and description.

njuffa · March 28, 2021, 9:44am

Unable to reproduce. I built for sm_61 using CUDA 11.1 on Windows 10. No error are reported when I run the app, and cuda-memcheck also indicates no errors.

Are you building the code for the architecture of your GPU (e.g. -arch=sm_61)?

1124171308 · March 28, 2021, 9:55am

Yes. I built on another computer on Windows 10 with larger GPU Memory, it sometimes runs successfully. I wonder if the only problem causes cudaErrorLaunchOutOfResources is “out of register”.

njuffa · March 28, 2021, 9:58am

I already addressed this earlier.

1124171308 · March 28, 2021, 10:03am

I’ll manage to fix it according that. Anyway, thank you very much for replying.

Topic		Replies	Views
cudaErrorLaunchOutOfResources aka "too many resources requested for launch" CUDA Programming and Performance	3	10317	July 29, 2013
cudaErrorLaunchOutOfResources: too many resources requested for launch, when 3480 registers used CUDA Programming and Performance cuda	2	1098	August 15, 2022
<500 threads and out of resources? 9600GT should support 512 threads/block CUDA Programming and Performance	9	3638	September 17, 2008
kernel launch error: 'too many resources requested for launch' CUDA Programming and Performance	4	2475	May 29, 2017
Kernel execution failed: Too many resources.. CUDA Programming and Performance	8	11546	November 29, 2007
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	26259	December 16, 2009
too many resources requested for launch CUDA Programming and Performance	28	25070	December 1, 2010
Origin of "too many resources requested for launch" in CUDA example CUDA Programming and Performance	4	336	November 22, 2023
too many resources requested for launch what does it exactly mean? CUDA Programming and Performance	3	1590	January 28, 2009
LAUNCH_OUT_OF_RESOURCES on TESLA D870, same algorithm on 9800 GX2 works ok.. CUDA Programming and Performance	5	1687	April 21, 2009

cudaErrorLaunchOutOfResources(701) when launching __global__ function

Related topics

cudaErrorLaunchOutOfResources(701) when launching global function