UVM CPU Fault on an Empty Kernel

So I’ve stared at this for a while now.

I’ve got a pretty lengthy piece of code so I’ll try and demonstrate what I am doing. I’ve got a piece of code that queues up lots of kernels in streams and gets it to to run through them.

cuDoubleComplex **submesh, **submesh_t

  for(int stream = 0; stream < total_streams; ++stream){

    int last = min; //Defined previously
    for(int wp = min; wp<=max; ++wp){

     kernel1 <<< 1, 64, 0, streams[stream] >>>

    transfer_kernel <<< dimBlock , dimGrid, 0, streams[stream] >>>
    (submesh_t[stream], wtransfer, 2, subgrid_size);


//More post-processing after this...

However its really odd in that all the kernels inside the inner loop seemingly run just fine but the outer loop kernel always crashes. I’ve reduced the kernel to just being empty, so the cuda runtime is instantiating an empty kernel with the block/grid parameters I put in, but it always fails with a really cryptic error message and cuda-memcheck yields:

========= Fatal UVM CPU fault due to invalid operation                         
=========     during read access to address 0x1357c92000                       
========= ERROR SUMMARY: 1 error

Everything that is being passed into these functions is allocated with unified memory, and if it isn’t its usually just some static value like an int or double, so just a standard parameter.

I’ve never seen this error before and googling around yields nothing. I am running nowhere near the maximum memory of the card. Any suggestions would be appreciated.

try using proper CUDA error checking

you’ve got your block and grid dimensions reversed on your outer-loop kernel. Grid dimensions come first, then block

I got the names the wrong way around for the block and grid, functionally they are correct as shown by my other code, but I guess I should of spotted that…

I did some proper checking in CUDA-GDB and its coming up as an Invalid Managed Memory Access. I wrap all my cuda API calls in a standard check error macro like most people do.

I don’t quite understand why it would be an Invalid Managed Memory Access in this case however.

for(int chunk = 0; chunk < total_chunks; ++chunk){

    for(int p = p_min; p<=p_max; ++p){

      fft_shift_kernel <<< dimBlock, dimGrid, 0, streams[chunk] >>> (subgrids[chunk],subgrid_size);
      cuFFTError_check(cufftExecZ2Z(fft_plan, subgrids[chunk], subgrids[chunk], CUFFT_INVERSE));
      fft_shift_kernel <<< dimBlock, dimGrid, 0, streams[chunk] >>> (subgrids[chunk],subgrid_size);
      cublas_mmul(handle, subgrids[chunk], transfer, subimgs[chunk], subgrid_size);
      last_p = p;
 //Check it actually ran...
  cudaError_t err = cudaGetLastError();
  std::cout << "Error: " << cudaGetErrorString(err) << "\n";

  for(int chunk = 0; chunk < total_chunks; ++chunk){

    transfer_kernel <<< dimBlock , dimGrid, 0, streams[chunk] >>>
      (subimgs[chunk], transfer, 2, subgrid_size);


For example this code reports no errors from the device (tested with cudaGetLastError before and after cudaDeviceSynchronize) and it seems to have a problem with the subimgs[chunk] pointer that I am passing to it. It’s definitely allocated with cudaMallocManaged, and the kernels inside the loop doesn’t have any problems with it when I am queueing up kernels in the streams. However when it leaves that last loop it seems to invalidate all mallocManaged pointers? Or the CUDA context is being invalidated. I am not too sure.

subimgs and transfer are just cuDoubleComplex array pointers. I would of thought if there was an invalid access in first loop, that would of invalidated the context, surely CUDA would report that?

Thanks again.

Actually I am pretty sure I can avoid this with a better method of allocating memory instead of multiple layers of indirection. I will report back once I’ve debugged some more.