I’m trying to launch a 24x1 grid of 42x35 blocks of a kernel with the following specs:
lmem = 0
smem = 13952
reg = 9
bar = 1
This fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on a Quadro FX 570. To the best of my knowledge, I am not out of memory by a large margin. (have: 256M, use: ~20M) According to my understanding of the docs, I feel like this should run OK, so I’m wondering what I’m using too much of.
I don’t understand your numbers … Is this 241 blocks of 4235 threads ? If yes, don’t forget you’re limited to 512 threads/block (42*35 is 1470). If no, how many threads do you launch per block?
Err, actually the limit is 16K, so he’s OK with that. He just won’t be able to have 2 blocks running at the same time on the same multiprocessor, causing non-optimal performances.
Its 16KBytes of shared memory for sure according to the docs, unless NVIDIA are adjusting it on certain cards? Not that I’ve ever read anything of the sort.
Have you tried allocating less blocks to start with? Which version of CUDA Toolkit?
Also I would double check the memory situation, if you have a dig in these forums there is a way of tracking actual usage of memory on a CUDA device. Just to be sure.
This is what comes out of cuMemGetInfo, right before my kernel invocation:
free:221320448
total:267714560
So, I’m relatively confident that it’s not lack of gmem, either. From further experimentation, it seems that the magic boundary is smem==8192. Has anybody gotten a kernel with more smem to run successfully? On a weak card like mine, too?
I’ve easily gone past 8192 smem no problem in 1.1, although havent tried v2 yet.
A weaker card has less streaming multiprocesors, but they’re all configured the same regardless of how many of them you have or which GPU you have (8 ALUs, 2 SFUs, some shared memory and access to the crossbar controller etc) so its much more likely to be a software compile issue.
Hmm, it seems this is due to my use of the driver API. MisterAnderson42, your test code does work on my card. However, a straight translation of your code to the driver API appears to fail at exactly smem>8192.
import pycuda.driver as drv
drv.init()
dev = drv.Device(0)
ctx = dev.make_context()
n = 2048-0 # works for "-4" (int * for funciton parameter is also in shared)
mod = drv.SourceModule("""
#include <stdio.h>
__global__ void kernel(int *d_data)
{
__shared__ int sdata[%d];
sdata[threadIdx.x] = threadIdx.x;
d_data[threadIdx.x] = sdata[threadIdx.x];
}
""" % n)
kernel = mod.get_function("kernel")
arg = drv.mem_alloc(n*4)
kernel(arg, shared=n*4, block=(1,1,1,), )
This is the trace of the sequence of driver API calls:
cuInit
cuDeviceGetCount
cuDeviceGet
cuCtxCreate
cuModuleLoadData
cuModuleGetFunction
cuMemAlloc
cuFuncSetBlockShape
cuParamSetv
cuParamSetSize
cuFuncSetSharedSize
cuLaunchGrid
Traceback (most recent call last):
File "big_smem.py", line 24, in <module>
kernel(arg, shared=n*4, block=(1,1,1,), )
File "/usr/lib/python2.5/site-packages/pycuda-0.90-py2.5-linux-x86_64.egg/pycuda/driver.py", line 122, in function_call
func.launch_grid(*grid)
RuntimeError: cuLaunchGrid failed: launch out of resources
cuMemFree
cuModuleUnload
cuCtxDetach
While I’ve never used the driver API, if it is similar to the runtime API you are double allocating your shared memory. Statically allocated shared memory such as
shared int sdata[4000] is memory accounted for in the cubin. Shared memory allocated dynamically in the kernel launch (at least in the runtime API), is additional shared memory that can be used in an “extern shared sdata” array. Since you are allocating 8192 bytes statically and 8192 bytes dynamically, the total is > 16k: hence the error.
I guess the lesson from this is that the description of cuFuncSetSharedSize should be updated. Currently, it reads:
sets through bytes the amount of shared memory that will be available to each thread block when the kernel given by func is launched.
I’d propose the following wording:
sets through bytes the amount of shared memory that will be available to each thread block in the form of extern __shared data[] in addition to any statically allocated shared memory when the kernel given by func is launched.