Illegal memory access crash

user137592 · January 17, 2022, 4:15pm

Hi,
I’m using the following cuda kernel:

cbalint13/pba/blob/610a945fc09c9a95884a8faad5c51ba77b0d96ed/src/pba/ProgramCU.cu#L292


      
          }
          
          
__global__ void vector_norm_kernel(const float* x, int len, int blen, float* result)
          {
              __shared__ float value[256];
              int bstart = blen * blockIdx.x;
              int start = bstart + threadIdx.x;
              int end   = min(len, bstart + blen);
              
              float v = 0;
              for(int i = start; i < end; i += blockDim.x) 
              {
                  float temp = x[i];
                  v += (temp * temp);
              }
              value[threadIdx.x] = v;
              // reduce to the first two values
              WARP_REDUCTION_256(value);
          
          
    // write back
              if ( threadIdx.x  == 0) result[blockIdx.x] = (value [0] + value[1]);

But in 1 out of 5 runs of my code I’m getting:
ComputeVectorNorm: an illegal memory access was encountered(700)

As can be seen below the ComputeVectorNorm is using the above cuda kernel:

github.com

cbalint13/pba/blob/610a945fc09c9a95884a8faad5c51ba77b0d96ed/src/pba/ProgramCU.cu#L316


      
          double ProgramCU::ComputeVectorNorm(CuTexImage& vector, CuTexImage& buf)
          {
          
          
    const unsigned int nblock = REDUCTION_NBLOCK; 
              unsigned int  bsize = 256;
              int  len  = vector.GetLength(); 
              int  blen = ((len  + nblock - 1)/ nblock + bsize - 1) / bsize * bsize; 
          
          
    ////////////////////////////////
              dim3 grid(nblock), block(bsize);
          
          
    /////////////////////////////////
              buf.InitTexture(nblock, 1); 
              vector_norm_kernel<<<grid, block>>>(vector.data(), len, blen,  buf.data());
              ProgramCU::CheckErrorCUDA("ComputeVectorNorm");
          
          

          
    float data[nblock]; buf.CopyToHost(data);
              double result = 0; 
              for(unsigned int i = 0; i < nblock; ++i) result += data[i];
              return result;

The stack trace is as follows:
*** SIGABRT (@0x7d000004de5) received by PID 19941 (TID 0x7f13bf56a700) from PID 19941; stack trace: ***

@ 0x7f13ca8a3980 (unknown)
@ 0x7f13c7afcfb7 gsignal
@ 0x7f13c7afe921 abort
@ 0x7f13c84f1957 (unknown)
@ 0x7f13c84f7ae6 (unknown)
@ 0x7f13c84f7b21 std::terminate()
@ 0x7f13c84f7d54 __cxa_throw
@ 0x55fe5cea7bd1 pba::ProgramCU::CheckErrorCUDA()
@ 0x55fe5ceb2fc5 pba::ProgramCU::ComputeVectorNorm()
@ 0x55fe5cea28d0 pba::SparseBundleCU::SolveNormalEquationPCGB()
@ 0x55fe5cea6242 pba::SparseBundleCU::NonlinearOptimizeLM()
@ 0x55fe5cea6f3c pba::SparseBundleCU::BundleAdjustment()
@ 0x55fe5cea6fa6 pba::SparseBundleCU::RunBundleAdjustment()
@ 0x55fe5cb68256 colmap::ParallelBundleAdjuster::Solve()
@ 0x55fe5cbe8040 colmap::IncrementalMapper::AdjustParallelGlobalBundle()
@ 0x55fe5cac19cc colmap::(anonymous namespace)::AdjustGlobalBundle()
@ 0x55fe5cac1c1f colmap::(anonymous namespace)::IterativeGlobalRefinement()
@ 0x55fe5cac25d7 colmap::IncrementalMapperController::Reconstruct()
@ 0x55fe5cac47db colmap::IncrementalMapperController::Run()
@ 0x55fe5cc5dbbc colmap::Thread::RunFunc()
@ 0x7f13c85226df (unknown)
@ 0x7f13ca8986db start_thread
@ 0x7f13c7bdf71f clone

I’m using two T4 GPUs through GCP

Any idea how to solve or debug this issue?

Thanks

striker159 · January 17, 2022, 4:57pm

Compile the code with -lineinfo , then use compute-sanitizer to locate the error.

user137592 · January 17, 2022, 5:14pm

Thanks
Do I need to add the -lineinfo to the CMAKE_CXX_FLAGS in the CMakelists.txt file?

afterwards, do I need to run my program with the prefix compute-sanitizer?

striker159 · January 17, 2022, 5:49pm

Yes, instead of ./program you would run compute-sanitizer ./program . You can find the manual here: Compute Sanitizer User Manual :: Compute Sanitizer Documentation

-lineinfo is a compiler flag for NVCC. I do not know how to set this with CMAKE.

user137592 · January 17, 2022, 7:56pm

bash: compute-sanitizer: command not found

I don’t have compute-sanitizer in my system but I’ve tried to run it with cuda-memcheck and couldn’t reproduce it, the program gets really slow when running with this prefix

striker159 · January 18, 2022, 6:43am

cuda-memcheck is fine, too, but it is deprecated in favor of computer-sanitizer.
It is expected that the program runs very slow with those tools.

If the program always runs without error with cuda-memcheck, this indicates a race condition of some kind, as expected by your comment “But in 1 out of 5 runs of my code I’m getting” . Might be a conflict between multiple cpu threads, might be a conflict between different cuda streams, I could not tell.

When I have this kind of problem in my code, I often add cudaDeviceSynchronize + error checking after each CUDA call. If this solves the issue, I remove cudaDeviceSynchronize until the error reappears.

user137592 · January 18, 2022, 8:52am

Thanks again

I’ve tried to add before and after the call “vector_norm_kernel” but the error still appears
What do you mean by adding error checking after each cuda call? how would you do it in case of my code?

striker159 · January 18, 2022, 9:12am

Check the return code of each cuda call, i.e. cudaMalloc, cudaFree, cudaMemcpy, cudaDeviceSynchronize, etc .
For kernel launches, you need a check of cudaGetLastError followed by checking the return code of cudaDeviceSynchronize.
Be aware that in case of multiple cpu threads a cuda error in one thread may be observed in a different thread.

user137592 · January 18, 2022, 10:12am

As you can see in the code attached above there is a call to cudaGetLastError
Which in the case of the crash returns:
cudaGetLastError: an illegal memory access was encountered(700)
I’ve added also a call to cudaDeviceSynchronize(after the call to cudaGetLastError)
and it also returns:
cudaDeviceSynchronize: an illegal memory access was encountered(700)

striker159 · January 18, 2022, 10:28am

There are multiple unchecked API calls in your program. I was not talking only about the check after your kernel.
For example, how do you know that buf.InitTexture is successful? If it is not, of course the kernel will fail.

If you have determined that the error originates from your kernel, you could dump the input values of each invocation to file and create a minimal reproducer for your bug.

user137592 · January 18, 2022, 11:04am

Regarding checking the API calls, I will try it, but if I will see that there is also an invalid memory access in another location for example in the buf.InitTexture what can I do with it? what’s the reason for this invalid memory access?
Thanks

striker159 · January 18, 2022, 11:09am

Well, when initTexture fails you try to access invalid memory in the kernel.

user137592 · January 18, 2022, 11:35am

I’ve tested the return value of initTexture(which checks cudaMalloc return value)
and there wasn’t any error before the crash in the vector_norm_kernel.

striker159 · January 18, 2022, 11:44am

Okay. I cannot give you more suggestions. If the error is in the kernel, try to create a minimal executable reproducer which can be used for further debugging.

user137592 · January 30, 2022, 8:47am

Hi @striker159
I didn’t manage to reproduce it using a minimal executable… any other suggestions?

rs277 · January 30, 2022, 6:18pm

Not sure if you tried it, cuda-memcheck can check for shared memory races, with the racecheck option:

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#racecheck-tool

Topic		Replies	Views
illegal memory access - any help appreciated CUDA Programming and Performance	5	6555	February 8, 2018
Help catching an illegal memory access CUDA Programming and Performance decoder , cuda , debugger	15	188	November 7, 2024
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	10794	September 26, 2017
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1087	February 2, 2014
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	740	March 20, 2018
How to interpret cudaMemCheck output of access violation? CUDA Programming and Performance	7	854	September 7, 2017
Tracking down CUDA illegal memory access CUDA Programming and Performance	1	1188	February 20, 2015
CUDA Runtime Problem: CUDA error with code=700(cudaErrorIllegalAddress) CUDA Programming and Performance	7	2935	September 5, 2022
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1711	March 20, 2017
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	409	October 12, 2021

Illegal memory access crash

Related topics