"too many resources requested for launch" on only 2 new Teslas

Our business purchased two new Tesla C1060 for a new compute server. We already have three compute servers, each with two C1060s (so we now have a total of eight C1060s). We have a particular kernel that runs without any problems on all of our previous Teslas, however, on both of the new Teslas, we get the following message:

[font=“Courier New”]“too many resources requested for launch”[/font]

Using the --ptxas-options=-v I get the following output:

[font=“Courier New”]
ptxas info : Compiling entry function ‘_Z7slcprojP6float2S0_PfPdttttddffffff’ for ‘sm_13’
ptxas info : Used 51 registers, 192+0 bytes lmem, 16328+16 bytes smem, 168 bytes cmem[0], 176 bytes cmem[1]
[/font]

My understanding is that with 128 blocks this should be fine. But regardless, the same code runs on all six of our other Tesla C1060s without fail. I even tried copying the binary executable (on Linux) and it still worked correctly. All of our other kernels work fine with the new boards, and I’ve tried running the examples inside the SDK without any problems.

Any ideas why only our two new Teslas are failing?

That’s odd that one C1060 would work but the other fails at launch because of resources.

If this is a new machine, not just new C1060s, I would first suspect it’s more likely something host-side. Is it the same OS, the same driver, the same toolkit? Both 32 bit or both 64 bit?

The quick diagnostic would of course be to physically swap a working C1060 in a working machine for a “troublesome” C1060 and see if it still fails.

That’s odd that one C1060 would work but the other fails at launch because of resources.

If this is a new machine, not just new C1060s, I would first suspect it’s more likely something host-side. Is it the same OS, the same driver, the same toolkit? Both 32 bit or both 64 bit?

The quick diagnostic would of course be to physically swap a working C1060 in a working machine for a “troublesome” C1060 and see if it still fails.

I side with SPWorley on the 32bit vs. 64 bit issue. 64 bit compiled code takes more space for pointers (and ints possibly), which can exceed your available shared memory.

I side with SPWorley on the 32bit vs. 64 bit issue. 64 bit compiled code takes more space for pointers (and ints possibly), which can exceed your available shared memory.

All machines are 64bit, have the same driver and toolkit (3.0). For fun I tried updating to 3.1 but I have the same problem. I like you idea of swapping C1060 boards, so I’ll try that next.

All machines are 64bit, have the same driver and toolkit (3.0). For fun I tried updating to 3.1 but I have the same problem. I like you idea of swapping C1060 boards, so I’ll try that next.

I swapped a Tesla in the new computer with a known working one from a different computer, and the problem didn’t follow the Tesla. So, it appears that somehow the problem is with the host, either its configuration or its hardware. Does anyone have any ideas for how I can go about trying to track this problem down? The machine is a Supermicro 7046GT-TRF, which specifically indicates it is capable of supporting 4 double-width GPUs. The system is running Fedora 13, 2.6.33.6-147.2.4.fc13.x86_64. The NVIDIA driver version is 195.36.24. I have the same software configuration running on some other computers, but don’t have this problem on them.

Thanks

I swapped a Tesla in the new computer with a known working one from a different computer, and the problem didn’t follow the Tesla. So, it appears that somehow the problem is with the host, either its configuration or its hardware. Does anyone have any ideas for how I can go about trying to track this problem down? The machine is a Supermicro 7046GT-TRF, which specifically indicates it is capable of supporting 4 double-width GPUs. The system is running Fedora 13, 2.6.33.6-147.2.4.fc13.x86_64. The NVIDIA driver version is 195.36.24. I have the same software configuration running on some other computers, but don’t have this problem on them.

Thanks

Since you have a working system, you have a great comparison to debug. Check and double check that each system’s software version is identical.

Are you sure, double sure, triple sure, that you’re using the exact same driver, the exact same toolkit, the exact same kernel code on both systems?

And if so, check it a fourth time anyway, maybe even reinstalling on both machines to make sure. It’s paranoid, but version-mismatch the most obvious cause and it’s a lot easier to debug than anything else.

Also, you’re running a stale driver, both 195.36.31 and 256.44 are newer. But this shouldn’t matter if you have your project successfully running on your other machine with the same driver.

Since you have a working system, you have a great comparison to debug. Check and double check that each system’s software version is identical.

Are you sure, double sure, triple sure, that you’re using the exact same driver, the exact same toolkit, the exact same kernel code on both systems?

And if so, check it a fourth time anyway, maybe even reinstalling on both machines to make sure. It’s paranoid, but version-mismatch the most obvious cause and it’s a lot easier to debug than anything else.

Also, you’re running a stale driver, both 195.36.31 and 256.44 are newer. But this shouldn’t matter if you have your project successfully running on your other machine with the same driver.

Thanks for the reply. I’ll go through and double/triple check everything.

The thing that bothers me most is, once it gets down to the level of allocating resources on the device itself, how can that be affected by the host? It seems that would be internal to the Tesla device itself and have nothing to do with the host system. It’s also odd that only this one kernel has a problem. All of my other kernels and all of the examples in the SDK run perfectly. I tried cutting the number of blocks for this kernel in half, but it still exhibits the same problem.

Thanks for the reply. I’ll go through and double/triple check everything.

The thing that bothers me most is, once it gets down to the level of allocating resources on the device itself, how can that be affected by the host? It seems that would be internal to the Tesla device itself and have nothing to do with the host system. It’s also odd that only this one kernel has a problem. All of my other kernels and all of the examples in the SDK run perfectly. I tried cutting the number of blocks for this kernel in half, but it still exhibits the same problem.

All of that functionality is contained in the host driver and host support libraries. The device itself is pretty dumb and relies on the host side driver for just about everything.

All of that functionality is contained in the host driver and host support libraries. The device itself is pretty dumb and relies on the host side driver for just about everything.

It’s possible that a change of OS from 32 to 64 bit, or a change in toolkit could change register use and/or shared memory use. If you’re at the edge of the device’s capabilities, that might push you over the edge and cause the kernel launch failure. You’re especially close to the shared memory limit of 16384 bytes now… so if something made that shared memory use change even a little, that could push you past the limits. (Especially because a few hundred bytes may be needed for argument overhead) A different toolkit might have different behavior for this… perhaps even the driver could affect it though that’s less likely.

But this analysis is incidental and hypothetical for now, you have a simpler problem of two theoretically identical systems which behave differently. It’s a lot easier to find out what is wrong first and get your system working. Then you can start asking why.

It’s possible that a change of OS from 32 to 64 bit, or a change in toolkit could change register use and/or shared memory use. If you’re at the edge of the device’s capabilities, that might push you over the edge and cause the kernel launch failure. You’re especially close to the shared memory limit of 16384 bytes now… so if something made that shared memory use change even a little, that could push you past the limits. (Especially because a few hundred bytes may be needed for argument overhead) A different toolkit might have different behavior for this… perhaps even the driver could affect it though that’s less likely.

But this analysis is incidental and hypothetical for now, you have a simpler problem of two theoretically identical systems which behave differently. It’s a lot easier to find out what is wrong first and get your system working. Then you can start asking why.

Also check the compositing settings in the desktop environment. Ideally, don’t start X at all.

Also check the compositing settings in the desktop environment. Ideally, don’t start X at all.

This error generally means that the #regs, #threads, or amt. of shared memory is too big, so occupancy is zero.

Given your description of the ptxas -v output, is very likely that in this case it’s being caused by the amount of shared memory being too big. You don’t mention how much dynamic smem you allocate at kernel launch… any?

Why this would be different from one machine to another is unclear, but as far as I know for this error code as long as you’re running the same driver, toolkit, and application code on each machine, there shouldn’t be any reason why one GPU would work and another one of the same compute capability would not.

–Cliff