What's the limit? A problem with kernel excecution failing


I am currently implementing a Finite-Difference Time-Domain (FDTD) analysis program, essentially having an array of electromagnetic field components and using each thread to perform an updating equation on each element,

The problem is, as I increase the number of elements in the array (and thus the number of threads I’m using) the kernel fails to do anything (without an error message), this happens when the number of elements is 447, which is somewhat less than I was expecting (when either the number of threads reaches 512 or when shared memory is full at 16kB)
Each thread uses 8 floating point numbers in its processing which are copied into shared memory when the kernel executes,

Any suggestions as to what might be going on, why it might be going on or advice from similar experiences?

Kind Regards

There are several possible scanarios which may lead to such problems.
I’d check kernel resource usage first. To do this add --ptxas-options=-v to your nvcc command line and check output — it will tell you how many register and shared memory are required for each of your kernels. Check that you’re not exceeding limits of 8192 registers and 16KiB shared memory.

I’ve added --ptxas-options=-v to my nvcc command line but it’s not telling me anything about the resource useage,


you should see a line for each of your kernels like:

register, # bytes shared mem, #bytes local mem and constant mem.


The only extra entry i get is:

1># entry  = updatinga

1>Compiling entry: updatinga

1># entry  = updating

1>Compiling entry: updating

(Where “updating” and “updatinga” are the names of the kernel functions)


Are you using CUDA 1.1? Seems more like 1.0…
In this case you need to generate .cubin files (nvcc -cubin) and examine it, kernel resource usage will be shown there.