Illegal memory access crash

Hi,
I’m using the following cuda kernel:

But in 1 out of 5 runs of my code I’m getting:
ComputeVectorNorm: an illegal memory access was encountered(700)

As can be seen below the ComputeVectorNorm is using the above cuda kernel:

The stack trace is as follows:
*** SIGABRT (@0x7d000004de5) received by PID 19941 (TID 0x7f13bf56a700) from PID 19941; stack trace: ***

@ 0x7f13ca8a3980 (unknown)
@ 0x7f13c7afcfb7 gsignal
@ 0x7f13c7afe921 abort
@ 0x7f13c84f1957 (unknown)
@ 0x7f13c84f7ae6 (unknown)
@ 0x7f13c84f7b21 std::terminate()
@ 0x7f13c84f7d54 __cxa_throw
@ 0x55fe5cea7bd1 pba::ProgramCU::CheckErrorCUDA()
@ 0x55fe5ceb2fc5 pba::ProgramCU::ComputeVectorNorm()
@ 0x55fe5cea28d0 pba::SparseBundleCU::SolveNormalEquationPCGB()
@ 0x55fe5cea6242 pba::SparseBundleCU::NonlinearOptimizeLM()
@ 0x55fe5cea6f3c pba::SparseBundleCU::BundleAdjustment()
@ 0x55fe5cea6fa6 pba::SparseBundleCU::RunBundleAdjustment()
@ 0x55fe5cb68256 colmap::ParallelBundleAdjuster::Solve()
@ 0x55fe5cbe8040 colmap::IncrementalMapper::AdjustParallelGlobalBundle()
@ 0x55fe5cac19cc colmap::(anonymous namespace)::AdjustGlobalBundle()
@ 0x55fe5cac1c1f colmap::(anonymous namespace)::IterativeGlobalRefinement()
@ 0x55fe5cac25d7 colmap::IncrementalMapperController::Reconstruct()
@ 0x55fe5cac47db colmap::IncrementalMapperController::Run()
@ 0x55fe5cc5dbbc colmap::Thread::RunFunc()
@ 0x7f13c85226df (unknown)
@ 0x7f13ca8986db start_thread
@ 0x7f13c7bdf71f clone

I’m using two T4 GPUs through GCP

Any idea how to solve or debug this issue?

Thanks

Compile the code with -lineinfo , then use compute-sanitizer to locate the error.

Thanks
Do I need to add the -lineinfo to the CMAKE_CXX_FLAGS in the CMakelists.txt file?

afterwards, do I need to run my program with the prefix compute-sanitizer?

Yes, instead of ./program you would run compute-sanitizer ./program . You can find the manual here: Compute Sanitizer User Manual :: Compute Sanitizer Documentation

-lineinfo is a compiler flag for NVCC. I do not know how to set this with CMAKE.

bash: compute-sanitizer: command not found

I don’t have compute-sanitizer in my system but I’ve tried to run it with cuda-memcheck and couldn’t reproduce it, the program gets really slow when running with this prefix

cuda-memcheck is fine, too, but it is deprecated in favor of computer-sanitizer.
It is expected that the program runs very slow with those tools.

If the program always runs without error with cuda-memcheck, this indicates a race condition of some kind, as expected by your comment “But in 1 out of 5 runs of my code I’m getting” . Might be a conflict between multiple cpu threads, might be a conflict between different cuda streams, I could not tell.

When I have this kind of problem in my code, I often add cudaDeviceSynchronize + error checking after each CUDA call. If this solves the issue, I remove cudaDeviceSynchronize until the error reappears.

Thanks again

  1. I’ve tried to add before and after the call “vector_norm_kernel” but the error still appears
  2. What do you mean by adding error checking after each cuda call? how would you do it in case of my code?

Check the return code of each cuda call, i.e. cudaMalloc, cudaFree, cudaMemcpy, cudaDeviceSynchronize, etc .
For kernel launches, you need a check of cudaGetLastError followed by checking the return code of cudaDeviceSynchronize.
Be aware that in case of multiple cpu threads a cuda error in one thread may be observed in a different thread.

As you can see in the code attached above there is a call to cudaGetLastError
Which in the case of the crash returns:
cudaGetLastError: an illegal memory access was encountered(700)
I’ve added also a call to cudaDeviceSynchronize(after the call to cudaGetLastError)
and it also returns:
cudaDeviceSynchronize: an illegal memory access was encountered(700)

There are multiple unchecked API calls in your program. I was not talking only about the check after your kernel.
For example, how do you know that buf.InitTexture is successful? If it is not, of course the kernel will fail.

If you have determined that the error originates from your kernel, you could dump the input values of each invocation to file and create a minimal reproducer for your bug.

Regarding checking the API calls, I will try it, but if I will see that there is also an invalid memory access in another location for example in the buf.InitTexture what can I do with it? what’s the reason for this invalid memory access?
Thanks

Well, when initTexture fails you try to access invalid memory in the kernel.

I’ve tested the return value of initTexture(which checks cudaMalloc return value)
and there wasn’t any error before the crash in the vector_norm_kernel.

Okay. I cannot give you more suggestions. If the error is in the kernel, try to create a minimal executable reproducer which can be used for further debugging.

Hi @striker159
I didn’t manage to reproduce it using a minimal executable… any other suggestions?

Not sure if you tried it, cuda-memcheck can check for shared memory races, with the racecheck option:

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#racecheck-tool