"too many resources requested for launch" on only 2 new Teslas

ekimd · August 17, 2010, 7:16pm

Our business purchased two new Tesla C1060 for a new compute server. We already have three compute servers, each with two C1060s (so we now have a total of eight C1060s). We have a particular kernel that runs without any problems on all of our previous Teslas, however, on both of the new Teslas, we get the following message:

[font=“Courier New”]“too many resources requested for launch”[/font]

Using the --ptxas-options=-v I get the following output:

[font=“Courier New”]
ptxas info : Compiling entry function ‘_Z7slcprojP6float2S0_PfPdttttddffffff’ for ‘sm_13’
ptxas info : Used 51 registers, 192+0 bytes lmem, 16328+16 bytes smem, 168 bytes cmem[0], 176 bytes cmem[1]
[/font]

My understanding is that with 128 blocks this should be fine. But regardless, the same code runs on all six of our other Tesla C1060s without fail. I even tried copying the binary executable (on Linux) and it still worked correctly. All of our other kernels work fine with the new boards, and I’ve tried running the examples inside the SDK without any problems.

Any ideas why only our two new Teslas are failing?

SPWorley · August 17, 2010, 7:22pm

That’s odd that one C1060 would work but the other fails at launch because of resources.

If this is a new machine, not just new C1060s, I would first suspect it’s more likely something host-side. Is it the same OS, the same driver, the same toolkit? Both 32 bit or both 64 bit?

The quick diagnostic would of course be to physically swap a working C1060 in a working machine for a “troublesome” C1060 and see if it still fails.

SPWorley · August 17, 2010, 7:22pm

That’s odd that one C1060 would work but the other fails at launch because of resources.

If this is a new machine, not just new C1060s, I would first suspect it’s more likely something host-side. Is it the same OS, the same driver, the same toolkit? Both 32 bit or both 64 bit?

The quick diagnostic would of course be to physically swap a working C1060 in a working machine for a “troublesome” C1060 and see if it still fails.

cbuchner1 · August 17, 2010, 7:45pm

I side with SPWorley on the 32bit vs. 64 bit issue. 64 bit compiled code takes more space for pointers (and ints possibly), which can exceed your available shared memory.

cbuchner1 · August 17, 2010, 7:45pm

I side with SPWorley on the 32bit vs. 64 bit issue. 64 bit compiled code takes more space for pointers (and ints possibly), which can exceed your available shared memory.

ekimd · August 17, 2010, 8:25pm

All machines are 64bit, have the same driver and toolkit (3.0). For fun I tried updating to 3.1 but I have the same problem. I like you idea of swapping C1060 boards, so I’ll try that next.

ekimd · August 17, 2010, 8:25pm

All machines are 64bit, have the same driver and toolkit (3.0). For fun I tried updating to 3.1 but I have the same problem. I like you idea of swapping C1060 boards, so I’ll try that next.

ekimd · August 18, 2010, 5:24pm

I swapped a Tesla in the new computer with a known working one from a different computer, and the problem didn’t follow the Tesla. So, it appears that somehow the problem is with the host, either its configuration or its hardware. Does anyone have any ideas for how I can go about trying to track this problem down? The machine is a Supermicro 7046GT-TRF, which specifically indicates it is capable of supporting 4 double-width GPUs. The system is running Fedora 13, 2.6.33.6-147.2.4.fc13.x86_64. The NVIDIA driver version is 195.36.24. I have the same software configuration running on some other computers, but don’t have this problem on them.

Thanks

ekimd · August 18, 2010, 5:24pm

I swapped a Tesla in the new computer with a known working one from a different computer, and the problem didn’t follow the Tesla. So, it appears that somehow the problem is with the host, either its configuration or its hardware. Does anyone have any ideas for how I can go about trying to track this problem down? The machine is a Supermicro 7046GT-TRF, which specifically indicates it is capable of supporting 4 double-width GPUs. The system is running Fedora 13, 2.6.33.6-147.2.4.fc13.x86_64. The NVIDIA driver version is 195.36.24. I have the same software configuration running on some other computers, but don’t have this problem on them.

Thanks

SPWorley · August 18, 2010, 6:18pm

Since you have a working system, you have a great comparison to debug. Check and double check that each system’s software version is identical.

Are you sure, double sure, triple sure, that you’re using the exact same driver, the exact same toolkit, the exact same kernel code on both systems?

And if so, check it a fourth time anyway, maybe even reinstalling on both machines to make sure. It’s paranoid, but version-mismatch the most obvious cause and it’s a lot easier to debug than anything else.

Also, you’re running a stale driver, both 195.36.31 and 256.44 are newer. But this shouldn’t matter if you have your project successfully running on your other machine with the same driver.

SPWorley · August 18, 2010, 6:18pm

Since you have a working system, you have a great comparison to debug. Check and double check that each system’s software version is identical.

Are you sure, double sure, triple sure, that you’re using the exact same driver, the exact same toolkit, the exact same kernel code on both systems?

And if so, check it a fourth time anyway, maybe even reinstalling on both machines to make sure. It’s paranoid, but version-mismatch the most obvious cause and it’s a lot easier to debug than anything else.

Also, you’re running a stale driver, both 195.36.31 and 256.44 are newer. But this shouldn’t matter if you have your project successfully running on your other machine with the same driver.

ekimd · August 18, 2010, 6:39pm

Thanks for the reply. I’ll go through and double/triple check everything.

The thing that bothers me most is, once it gets down to the level of allocating resources on the device itself, how can that be affected by the host? It seems that would be internal to the Tesla device itself and have nothing to do with the host system. It’s also odd that only this one kernel has a problem. All of my other kernels and all of the examples in the SDK run perfectly. I tried cutting the number of blocks for this kernel in half, but it still exhibits the same problem.

ekimd · August 18, 2010, 6:39pm

Thanks for the reply. I’ll go through and double/triple check everything.

The thing that bothers me most is, once it gets down to the level of allocating resources on the device itself, how can that be affected by the host? It seems that would be internal to the Tesla device itself and have nothing to do with the host system. It’s also odd that only this one kernel has a problem. All of my other kernels and all of the examples in the SDK run perfectly. I tried cutting the number of blocks for this kernel in half, but it still exhibits the same problem.

avidday · August 18, 2010, 6:46pm

All of that functionality is contained in the host driver and host support libraries. The device itself is pretty dumb and relies on the host side driver for just about everything.

avidday · August 18, 2010, 6:46pm

All of that functionality is contained in the host driver and host support libraries. The device itself is pretty dumb and relies on the host side driver for just about everything.

SPWorley · August 18, 2010, 7:28pm

It’s possible that a change of OS from 32 to 64 bit, or a change in toolkit could change register use and/or shared memory use. If you’re at the edge of the device’s capabilities, that might push you over the edge and cause the kernel launch failure. You’re especially close to the shared memory limit of 16384 bytes now… so if something made that shared memory use change even a little, that could push you past the limits. (Especially because a few hundred bytes may be needed for argument overhead) A different toolkit might have different behavior for this… perhaps even the driver could affect it though that’s less likely.

But this analysis is incidental and hypothetical for now, you have a simpler problem of two theoretically identical systems which behave differently. It’s a lot easier to find out what is wrong first and get your system working. Then you can start asking why.

SPWorley · August 18, 2010, 7:28pm

It’s possible that a change of OS from 32 to 64 bit, or a change in toolkit could change register use and/or shared memory use. If you’re at the edge of the device’s capabilities, that might push you over the edge and cause the kernel launch failure. You’re especially close to the shared memory limit of 16384 bytes now… so if something made that shared memory use change even a little, that could push you past the limits. (Especially because a few hundred bytes may be needed for argument overhead) A different toolkit might have different behavior for this… perhaps even the driver could affect it though that’s less likely.

But this analysis is incidental and hypothetical for now, you have a simpler problem of two theoretically identical systems which behave differently. It’s a lot easier to find out what is wrong first and get your system working. Then you can start asking why.

tera · August 18, 2010, 9:11pm

Also check the compositing settings in the desktop environment. Ideally, don’t start X at all.

tera · August 18, 2010, 9:11pm

Also check the compositing settings in the desktop environment. Ideally, don’t start X at all.

Cliff_Woolley · August 23, 2010, 8:19pm

This error generally means that the #regs, #threads, or amt. of shared memory is too big, so occupancy is zero.

Given your description of the ptxas -v output, is very likely that in this case it’s being caused by the amount of shared memory being too big. You don’t mention how much dynamic smem you allocate at kernel launch… any?

Why this would be different from one machine to another is unclear, but as far as I know for this error code as long as you’re running the same driver, toolkit, and application code on each machine, there shouldn’t be any reason why one GPU would work and another one of the same compute capability would not.

–Cliff

Topic		Replies	Views
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10183	September 14, 2010
very large data set (big matrix) CUDA Programming and Performance	10	3005	October 17, 2009
Too many resources requested for launch: Strange Case CUDA Programming and Performance	6	1023	February 25, 2020
too many resources requested for launch CUDA Programming and Performance	28	24840	December 1, 2010
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3047	November 13, 2017
Different performance from different GPUs with Identical Code CUDA Programming and Performance	18	4365	April 11, 2012
"unspecified launch failure" runtime failure CUDA Programming and Performance	6	3322	May 9, 2009
Strange result Comparing a Tesla C1060 against GTS 250 CUDA Programming and Performance	16	1986	December 4, 2010
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	25913	December 16, 2009
compitable servers for S1070 collect some information CUDA Programming and Performance	20	27506	August 8, 2011

"too many resources requested for launch" on only 2 new Teslas

Related topics