Problem calling __device__ function

I have been trying to call a device function but am getting incorrect results even if the function is doing nothing. Example:

device void getValue() {
// do nothing
}

It doesn’t matter what I put in that function, the result is the same.

I tried cuda-memcheck and got a lot of errors like this:

========= Program hit error 7 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/nvidia-319-updates/libcuda.so [0x274640]
========= Host Frame:./interpolateCuda [0x3cfde]
========= Host Frame:./interpolateCuda [0x4d77]
========= Host Frame:./interpolateCuda [0x4ab4]
========= Host Frame:./interpolateCuda [0x4b1d]
========= Host Frame:./interpolateCuda [0x48b6]
========= Host Frame:./interpolateCuda [0x2c32]
========= Host Frame:./interpolateCuda [0x40d2]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21de5]
========= Host Frame:./interpolateCuda [0x2aa9]

cuda-memcheck runs clean if I remove the dummy function. I checked the error codes and it looks like it is cudaErrorLaunchOutOfResources. If I reduce the number of threads from 32x32 (1024) to 24x24 (576) the program runs correctly. Does calling an empty function really use that many resources? I thought the device functions were getting inlined anyway and with no actual instructions in the function it should not have an effect on run time requirements. Does anybody have an idea what is going on?