Launch out of Resources: Why?

I’m trying to launch a 24x1 grid of 42x35 blocks of a kernel with the following specs:

       lmem = 0

        smem = 13952

        reg = 9

        bar = 1

This fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on a Quadro FX 570. To the best of my knowledge, I am not out of memory by a large margin. (have: 256M, use: ~20M) According to my understanding of the docs, I feel like this should run OK, so I’m wondering what I’m using too much of.

Thanks for any insight,

Andreas

There’s only 8K shared memory available, and you’re using about 14K…

Try to reduce your share memory use.

I don’t understand your numbers … Is this 241 blocks of 4235 threads ? If yes, don’t forget you’re limited to 512 threads/block (42*35 is 1470). If no, how many threads do you launch per block?

Err, actually the limit is 16K, so he’s OK with that. He just won’t be able to have 2 blocks running at the same time on the same multiprocessor, causing non-optimal performances.

Whoops. Sorry. That should be 14*35=490.

Its 16KBytes of shared memory for sure according to the docs, unless NVIDIA are adjusting it on certain cards? Not that I’ve ever read anything of the sort.

Have you tried allocating less blocks to start with? Which version of CUDA Toolkit?

Also I would double check the memory situation, if you have a dig in these forums there is a way of tracking actual usage of memory on a CUDA device. Just to be sure.

I’ve tried with a 1x1 grid, no difference.

I’m using CUDA 2 beta.

This is what comes out of cuMemGetInfo, right before my kernel invocation:

free:221320448
total:267714560

So, I’m relatively confident that it’s not lack of gmem, either. From further experimentation, it seems that the magic boundary is smem==8192. Has anybody gotten a kernel with more smem to run successfully? On a weak card like mine, too?

Thanks for all your help
Andreas

What platform are you running on? I’ve no problems with the CUDA 2.0 beta on linux x86_64.

nvcc -o shmem_test shmem_test.cu --ptxas-options -v

ptxas info    : Compiling entry function '_Z6kernelPi'

ptxas info    : Used 2 registers, 16024+24 bytes smem

./shmem_test

Success

shmem_test.cu.gz (310 Bytes)

Try running the code with v1.1 toolkit?

I’ve easily gone past 8192 smem no problem in 1.1, although havent tried v2 yet.

A weaker card has less streaming multiprocesors, but they’re all configured the same regardless of how many of them you have or which GPU you have (8 ALUs, 2 SFUs, some shared memory and access to the crossbar controller etc) so its much more likely to be a software compile issue.

Hmm, it seems this is due to my use of the driver API. MisterAnderson42, your test code does work on my card. However, a straight translation of your code to the driver API appears to fail at exactly smem>8192.

import pycuda.driver as drv

drv.init()

dev = drv.Device(0)

ctx = dev.make_context()

n = 2048-0 # works for "-4" (int * for funciton parameter is also in shared)

mod = drv.SourceModule("""

#include <stdio.h>

__global__ void kernel(int *d_data)

{

__shared__ int sdata[%d];

sdata[threadIdx.x] = threadIdx.x;

d_data[threadIdx.x] = sdata[threadIdx.x];

}

""" % n)

kernel = mod.get_function("kernel")

arg = drv.mem_alloc(n*4)

kernel(arg, shared=n*4, block=(1,1,1,), )

This is the trace of the sequence of driver API calls:

cuInit

cuDeviceGetCount

cuDeviceGet

cuCtxCreate

cuModuleLoadData

cuModuleGetFunction

cuMemAlloc

cuFuncSetBlockShape

cuParamSetv

cuParamSetSize

cuFuncSetSharedSize

cuLaunchGrid

Traceback (most recent call last):

  File "big_smem.py", line 24, in <module>

    kernel(arg, shared=n*4, block=(1,1,1,), )

  File "/usr/lib/python2.5/site-packages/pycuda-0.90-py2.5-linux-x86_64.egg/pycuda/driver.py", line 122, in function_call

    func.launch_grid(*grid)

RuntimeError: cuLaunchGrid failed: launch out of resources

cuMemFree

cuModuleUnload

cuCtxDetach

While I’ve never used the driver API, if it is similar to the runtime API you are double allocating your shared memory. Statically allocated shared memory such as

shared int sdata[4000] is memory accounted for in the cubin. Shared memory allocated dynamically in the kernel launch (at least in the runtime API), is additional shared memory that can be used in an “extern shared sdata” array. Since you are allocating 8192 bytes statically and 8192 bytes dynamically, the total is > 16k: hence the error.

Aaaaah!

And I was wondering why the API needed to be told something that it can easily figure from the CUBIN. :argh:

You absolutely made my day. :thumbsup:

Thanks so much!
Andreas

I guess the lesson from this is that the description of cuFuncSetSharedSize should be updated. Currently, it reads:

sets through bytes the amount of shared memory that will be available to each thread block when the kernel given by func is launched.

I’d propose the following wording:

sets through bytes the amount of shared memory that will be available to each thread block in the form of extern __shared data[] in addition to any statically allocated shared memory when the kernel given by func is launched.

NVIDIA people? You listening?

Thanks
Andreas