Can a kernel method be too big to compute? Getting cudaErrorUnknown for unknown reasons

I have quite a large kernel method (more than a thousand lines of code) which returns with error cudaErrorUnknown.
Code itself is fine - works in CPU, passes cuda-memcheck with zero errors. Yet returning with error.
Can it be that the method is just too big for the device? I don’t see any other problems with it.

When I comment out parts of the method it works (doesn’t return the error). Uncommenting just one more line causes cudaErrorUnknown again.
The line is innocent and commenting another part of the kernel method makes same effect.

Increasing memory with cudaLimitMallocHeapSize and cudaLimitStackSize doesn’t make a difference. Which means that the method is failing not because it reached the limits of heap or stack.

It’s not possible to break the method in several parts and run concurrently due to high interconnectedness of code and intermediate results.

Has anyone seen something like that? How did you fix that?

Are you using malloc or new within the kernel? If so, I would be checking the returned pointer if it is null or has been assigned memory.

I would also be checking how you launch the kernel as you may be exceeding the memory per block limits.