I have quite a large kernel method (more than a thousand lines of code) which returns with error cudaErrorUnknown.
Code itself is fine - works in CPU, passes cuda-memcheck with zero errors. Yet returning with error.
Can it be that the method is just too big for the device? I don’t see any other problems with it.
When I comment out parts of the method it works (doesn’t return the error). Uncommenting just one more line causes cudaErrorUnknown again.
The line is innocent and commenting another part of the kernel method makes same effect.
Increasing memory with cudaLimitMallocHeapSize and cudaLimitStackSize doesn’t make a difference. Which means that the method is failing not because it reached the limits of heap or stack.
It’s not possible to break the method in several parts and run concurrently due to high interconnectedness of code and intermediate results.
Has anyone seen something like that? How did you fix that?