Cuda error 77 (0x4d) when increasing problem size

Hi All,

I’ve created CUDA-enabled software which performs some matrix operations in parallel. It works fine. The problem is starting from certain size of the problem I’m always getting the same error - code 77 which is CudaErrorIllegalAccess I suppose. The code is quite complicated so there’s certainly no worth in pasting it here - just tell me what do you want to know and I’ll explain or paste it here. The awkward thing is the algoritm is very generic and performs the very same operations - just on larger datasets. This data isn’t that big. It generates only about 4% of GPU load (CPU load is quite high - 99%) and mem occupancy is only 12mb.

I know at this stage is like guessing but what would be your suggestions at this point? What to check? I tired to analyze that cuda performance reports in visual studio but I haven’t notice anything awkward. I’m attaching one for the largest working problem size and the one the algorithm starts failing on.
lastWorking.rar (1.6 MB)
firstFailing.rar (988 KB)

You can use cuda-memcheck or the CUDA debugger to find the exact location of the out-of-bounds memory access. Likely causes are incorrect allocation size, incorrect indexing computation.

Did you analyze the attached files? There must be a clue in them. I’ve used both tools extensively and I still don’t see why it’s not working - it works for 11N but for 12N it does not. Makes no sense.

You don’t seem to have responded to the suggestion.

In the failing case, did you run it with cuda-memcheck?

If so, were any errors reported?

You state that you have both cuda-memcheck and the CUDA debugger extensively, but it is not clear how you have used them. They should allow to pin-point exactly where in the source code the “bad” memory access occurs, and of what nature the access issue is (e.g. address out of bound, or misaligned access). Have you gotten this far?

Once you have the failing address and where it occurs in the code, you should be able to trace back to where it originated. For example, for a misaligned access, you might want to look for a pointer cast from a narrower type to a wider type. For an out-of-bound access you would first check whether the relevant memory allocation has the correct size (it may have been sized too small), and if that seems correct, trace back the indexing computation or pointer manipulation that computed the out-of-bounds address. This trace-back may go across a considerable distance in the call chain.

The important thing in any debugging process is to proceed methodically, even if may take a long time. First determine the exact point of failure (the tools will help you with that), than trace back to the root cause. Personally, I like sprinkling printf() in the code for the traceback, thus creating a log of the activity that led up to the failure. These logs work better for me than inspecting variables in a debugger, but debugging preference will differ widely from person to person.

Ok found an anwser:

size_t newHeapSize = 1024 * 1000 * megabytesToUse;
gpuErrchk(cudaDeviceSetLimit(cudaLimitMallocHeapSize, newHeapSize));
printf("Adjusted heap size to be %d\n",(int) newHeapSize);

Heap size for in-kernel memory allocation was exhausted. The default is about 8MB for my setup.
As you probably can see there was no apparent reason of failure.