Launch out of Resources: Why?

inducer · May 27, 2008, 2:18pm

I’m trying to launch a 24x1 grid of 42x35 blocks of a kernel with the following specs:

       lmem = 0

        smem = 13952

        reg = 9

        bar = 1

This fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on a Quadro FX 570. To the best of my knowledge, I am not out of memory by a large margin. (have: 256M, use: ~20M) According to my understanding of the docs, I feel like this should run OK, so I’m wondering what I’m using too much of.

Thanks for any insight,

Andreas

Mu-Chi_Sung · May 27, 2008, 2:27pm

I’m trying to launch a 24x1 grid of 42x35 blocks of a kernel with the following specs:
Â  Â  Â  Â lmem = 0

 Â  Â  Â  Â smem = 13952

 Â  Â  Â  Â reg = 9

 Â  Â  Â  Â bar = 1
This fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on a Quadro FX 570. To the best of my knowledge, I am not out of memory by a large margin. (have: 256M, use: ~20M) According to my understanding of the docs, I feel like this should run OK, so I’m wondering what I’m using too much of.

Thanks for any insight,

Andreas

[snapback]384100[/snapback]

There’s only 8K shared memory available, and you’re using about 14K…

Try to reduce your share memory use.

tim-timmy · May 27, 2008, 3:05pm

I don’t understand your numbers … Is this 241 blocks of 4235 threads ? If yes, don’t forget you’re limited to 512 threads/block (42*35 is 1470). If no, how many threads do you launch per block?

Err, actually the limit is 16K, so he’s OK with that. He just won’t be able to have 2 blocks running at the same time on the same multiprocessor, causing non-optimal performances.

inducer · May 27, 2008, 5:10pm

Whoops. Sorry. That should be 14*35=490.

smokescreen · May 28, 2008, 7:52am

Its 16KBytes of shared memory for sure according to the docs, unless NVIDIA are adjusting it on certain cards? Not that I’ve ever read anything of the sort.

smokescreen · May 28, 2008, 7:54am

Have you tried allocating less blocks to start with? Which version of CUDA Toolkit?

Also I would double check the memory situation, if you have a dig in these forums there is a way of tracking actual usage of memory on a CUDA device. Just to be sure.

inducer · May 28, 2008, 3:55pm

I’ve tried with a 1x1 grid, no difference.

I’m using CUDA 2 beta.

This is what comes out of cuMemGetInfo, right before my kernel invocation:

free:221320448
total:267714560

So, I’m relatively confident that it’s not lack of gmem, either. From further experimentation, it seems that the magic boundary is smem==8192. Has anybody gotten a kernel with more smem to run successfully? On a weak card like mine, too?

Thanks for all your help
Andreas

MisterAnderson42 · May 28, 2008, 4:07pm

What platform are you running on? I’ve no problems with the CUDA 2.0 beta on linux x86_64.

nvcc -o shmem_test shmem_test.cu --ptxas-options -v

ptxas info    : Compiling entry function '_Z6kernelPi'

ptxas info    : Used 2 registers, 16024+24 bytes smem

./shmem_test

Success

shmem_test.cu.gz (310 Bytes)

smokescreen · May 28, 2008, 6:51pm

Try running the code with v1.1 toolkit?

I’ve easily gone past 8192 smem no problem in 1.1, although havent tried v2 yet.

A weaker card has less streaming multiprocesors, but they’re all configured the same regardless of how many of them you have or which GPU you have (8 ALUs, 2 SFUs, some shared memory and access to the crossbar controller etc) so its much more likely to be a software compile issue.

inducer · May 28, 2008, 8:14pm

Hmm, it seems this is due to my use of the driver API. MisterAnderson42, your test code does work on my card. However, a straight translation of your code to the driver API appears to fail at exactly smem>8192.

import pycuda.driver as drv

drv.init()

dev = drv.Device(0)

ctx = dev.make_context()

n = 2048-0 # works for "-4" (int * for funciton parameter is also in shared)

mod = drv.SourceModule("""

#include <stdio.h>

__global__ void kernel(int *d_data)

{

__shared__ int sdata[%d];

sdata[threadIdx.x] = threadIdx.x;

d_data[threadIdx.x] = sdata[threadIdx.x];

}

""" % n)

kernel = mod.get_function("kernel")

arg = drv.mem_alloc(n*4)

kernel(arg, shared=n*4, block=(1,1,1,), )

This is the trace of the sequence of driver API calls:

cuInit

cuDeviceGetCount

cuDeviceGet

cuCtxCreate

cuModuleLoadData

cuModuleGetFunction

cuMemAlloc

cuFuncSetBlockShape

cuParamSetv

cuParamSetSize

cuFuncSetSharedSize

cuLaunchGrid

Traceback (most recent call last):

  File "big_smem.py", line 24, in <module>

    kernel(arg, shared=n*4, block=(1,1,1,), )

  File "/usr/lib/python2.5/site-packages/pycuda-0.90-py2.5-linux-x86_64.egg/pycuda/driver.py", line 122, in function_call

    func.launch_grid(*grid)

RuntimeError: cuLaunchGrid failed: launch out of resources

cuMemFree

cuModuleUnload

cuCtxDetach

MisterAnderson42 · May 28, 2008, 8:29pm

While I’ve never used the driver API, if it is similar to the runtime API you are double allocating your shared memory. Statically allocated shared memory such as

shared int sdata[4000] is memory accounted for in the cubin. Shared memory allocated dynamically in the kernel launch (at least in the runtime API), is additional shared memory that can be used in an “extern shared sdata” array. Since you are allocating 8192 bytes statically and 8192 bytes dynamically, the total is > 16k: hence the error.

inducer · May 28, 2008, 8:39pm

Aaaaah!

And I was wondering why the API needed to be told something that it can easily figure from the CUBIN. External Image

You absolutely made my day. External Image

Thanks so much!
Andreas

inducer · May 28, 2008, 9:28pm

I guess the lesson from this is that the description of cuFuncSetSharedSize should be updated. Currently, it reads:

sets through bytes the amount of shared memory that will be available to each thread block when the kernel given by func is launched.

I’d propose the following wording:

sets through bytes the amount of shared memory that will be available to each thread block in the form of extern __shared data[] in addition to any statically allocated shared memory when the kernel given by func is launched.

NVIDIA people? You listening?

Thanks
Andreas

Topic		Replies	Views
too many resources requested for launch CUDA Programming and Performance	2	5938	September 7, 2009
too many resources requested for launch CUDA Programming and Performance	28	25010	December 1, 2010
Size limit on dynamic allocated shared memory CUDA Programming and Performance	2	1514	November 6, 2008
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2528	August 4, 2008
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2881	April 29, 2009
Kernel execution failed: Too many resources.. CUDA Programming and Performance	8	11520	November 29, 2007
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	26205	December 16, 2009
too many resources requested for launch what does it exactly mean? CUDA Programming and Performance	3	1578	January 28, 2009
Causes of unspecified launch failure CUDA Programming and Performance	8	8965	July 9, 2009
too many resources requested for launch how to find out what resources? CUDA Programming and Performance	3	22797	November 22, 2007

Launch out of Resources: Why?

Related topics