Program hit cudaErrorIllegalAddress (error 700) [...] on CUDA API call to cudaDeviceSynchronize

Hi there

My CUDA program crashes consistently for large inputs and occasinally for small ones.
I used CUDA-MEMCHECK to look for out-of-bounds memory accesses and fixed the ones I found.
I am still getting crashes however, CUDA-MEMCHECK reports them occuring inside cudaDeviceSynchronize, Nsight reports the (same) error in cuCtxSynchronize.
I’ve run out of debugging options, so I’d be very happy for any advice on how to debug this.

Thanks,
Joel

Full CUDA-MEMCHECK output:
========= CUDA-MEMCHECK
PASSED ebs_copy_test
PASSED ebs_num_test
Allocating Memory…
Initializing Reference Sequence…
Allocating Memory (171B) for 9 Reads
Initializing Reads…
Starting Kernel…
========= Error: process didn’t terminate successfully
========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under a host debugger to catch such errors.
========= Program hit cudaErrorIllegalAddress (error 700) due to “an illegal memory access was encountered” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvmdi.inf_amd64_b5c7e9f1cc7d29c6\nvcuda64.dll (cuProfilerStop + 0x9da58) [0x2ccdb8]
========= Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvmdi.inf_amd64_b5c7e9f1cc7d29c6\nvcuda64.dll (cuProfilerStop + 0xa011a) [0x2cf47a]
========= Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvmdi.inf_amd64_b5c7e9f1cc7d29c6\nvcuda64.dll [0x8035e]
========= Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvmdi.inf_amd64_b5c7e9f1cc7d29c6\nvcuda64.dll (cuProfilerStop + 0x1229fa) [0x351d5a]
========= Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvmdi.inf_amd64_b5c7e9f1cc7d29c6\nvcuda64.dll (cuProfilerStop + 0x13db82) [0x36cee2]
========= Host Frame:C:\Users\joel\source\repos\genasm-gpu\genasm_gpu.exe (cudart::cudaApiChooseDevice + 0x41) [0x18e1]
========= Host Frame:C:\Users\joel\source\repos\genasm-gpu\genasm_gpu.exe (cudart::cudaApiStreamEndCapture_ptsz + 0x33) [0x10703]
========= Host Frame:C:\Users\joel\source\repos\genasm-gpu\genasm_gpu.exe (cudaGetErrorName + 0x15) [0x18305]
========= Host Frame:C:\Users\joel\source\repos\genasm-gpu\genasm_gpu.exe (cudaGraphExecKernelNodeSetParams + 0x3) [0x1c2d3]
========= Host Frame:C:\Users\joel\source\repos\genasm-gpu\genasm_gpu.exe (cudaHostAlloc + 0x124) [0x20514]
========= Host Frame:C:\Windows\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17034]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]
=========
========= No CUDA-MEMCHECK results found

Hi, could you provide the source code for this issue?

Hi
Thanks for the reply, I completely forgot about this post…

Through trial and error(s) I eventually figured out it was an out of bounds access by a GPU kernel into a large block (2GB) of unified memory. My best guess is that MEMCHECK cannot deal with such large memory blocks, since above error message is not that helpful of course.

If helpful I can provide the corresponding version of the source code, since this is part of an unpublished research work I cannot do this publicly (yet).

Thanks,
Joel

If you get a chance, please try the compute-sanitizer tool as a drop-in replacement for cuda-memcheck.

Will do, thanks!