Segfault in cudaMemcpy after system("") in L4T 32.3.1

The following code runs fine on L4T 32.2.2 but segfaults on L4T 32.3.1:

#include <unistd.h>
#include <cuda.h>
#include <iostream>

using namespace std;

int main()
{
  while(true)
  {
    sleep(2);
    uint8_t data_host;
    system("");
    uint8_t* data_device;
    if(cudaMalloc(&data_device, sizeof(data_host)) != cudaSuccess)
    {
      cout<<"cudaMalloc failed"<<endl;
    }
    if(cudaMemcpy(data_device, &data_host, sizeof(data_host), cudaMemcpyHostToDevice) != cudaSuccess)
    {
      cout<<"cudaMemcpy failed"<<endl;
    }
    if(cudaFree(data_device) != cudaSuccess)
    {
      cout<<"cudaFree failed"<<endl;
    }
  }
}

It seems the system("") is somehow causing the problem. As far as I can tell, the content of the system() call is irrelevant.

Here is the stack trace:

Received SIGSEGV
./IcccSim(+0xf94)[0x5579912f94]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x7f92acb6c0]
/usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so(+0x1aa7c)[0x7f9169fa7c]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x2a234c)[0x7f9196a34c]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x2243e8)[0x7f918ec3e8]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x11272c)[0x7f917da72c]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x11281c)[0x7f917da81c]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x1c0d70)[0x7f91888d70]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0x1c0fcc)[0x7f91888fcc]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0xf8578)[0x7f917c0578]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(+0xfbd34)[0x7f917c3d34]
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1(cuMemcpyHtoD_v2+0x60)[0x7f918dff98]
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart.so.10.0(+0x2f0b0)[0x7f92a3e0b0]

Update:

The code continues to segfault if I sleep up to 0.45 seconds between system("") and cudaMalloc(…), but seems to work if I sleep 0.5 seconds or more between system("") and cudaMalloc(…)

Interestingly, the code does not segfault if I run jetson_clocks before running it. This is not an acceptable solution for my end application, however.

Update:

The same code also segfaults in L4T 32.4.2.

Hi,

Please noticed that CUDA function will trigger some GPU task.
Since CPU may not always be block-code, you will need to add some synchronization code before reusing the buffer.

It looks like there is no any synchronization code added yet.
Would you mind to add it at the end of loop and try it again?

Thanks.

I have added a call to cudaDeviceSynchronize() as shown below, but the behavior is still the same. If this is not what you meant by synchronization code please provide more details.

#include <unistd.h>
#include <cuda.h>
#include <iostream>

using namespace std;

int main()
{
  while(true)
  {
    sleep(2);
    uint8_t data_host;
    system("");
    uint8_t* data_device;
    if(cudaMalloc(&data_device, sizeof(data_host)) != cudaSuccess)
    {
      cout<<"cudaMalloc failed"<<endl;
    }
    if(cudaMemcpy(data_device, &data_host, sizeof(data_host), cudaMemcpyHostToDevice) != cudaSuccess)
    {
      cout<<"cudaMemcpy failed"<<endl;
    }
    if(cudaFree(data_device) != cudaSuccess)
    {
      cout<<"cudaFree failed"<<endl;
    }
    if(cudaDeviceSynchronize() != cudaSuccess)
    {
      cout<<"cudaDeviceSynchronize failed"<<endl;
    }
  }
}

Hi,

Thanks for your testing.
Let us check this in our environment and update more information later.

Thanks.

Hi,

We test your source (the original version) on rel32.4.2 and it runs fine.

Do we miss anything to reproduce this issue?
How long it takes to reproduce this issue?

By the way, may I know which nvpmodel mode do you use?
And have you locked the clock to the maximal with jetson_clocks?

Thanks.

I have reproduced the issue with all nvpmodel modes in L4T 32.3.1. The error typically occurs within 10 seconds.

I have not been able to reproduce the issue after calling jetson_clocks. I do not want to call jetson_clocks in my application because I want dynamic frequency scaling.

Hi,

Thanks for the update.

This make sense to me.
Calling the jetson_clocks will shorter the kernel response time and somehow skip the error.

We will check this and update more information with you later.

Thanks.

Hi,

We got some progress on this.

First, we check the detail error log from dmesg:

[  491.730393] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR]  Channel 507 to be removed from TSG 4 has NEXT set!
[  491.730641] nvgpu: 17000000.gv11b          gk20a_tsg_unbind_channel:164  [ERR]  Channel 507 unbind failed, tearing down TSG 4

Based on the log, this error is caused by closing an active GPU channel.
To avoid this, you will need to make sure the CUDA context is idle before closing.

So why this happens?
Our guess is that the process try to close GPU context before making a system call.
And this behavior triggers the segmentation fault if the context is not empty.

Here is a possible fix for your reference.

The cudaDeviceSynchronize is to make sure the tasks is finished while the cudaDeviceReset cleans out the CUDA context.

diff --git a/topic_124928.cu b/topic_124928.cu
index 7aeedcd..1d5bd97 100644
--- a/topic_124928.cu
+++ b/topic_124928.cu
@@ -10,6 +10,8 @@ int main()
   {
     sleep(2);
     uint8_t data_host;
+
+    cudaDeviceReset();
     system("");
     uint8_t* data_device;
     if(cudaMalloc(&data_device, sizeof(data_host)) != cudaSuccess)
@@ -24,5 +26,6 @@ int main()
     {
       cout<<"cudaFree failed"<<endl;
     }
+    cudaDeviceSynchronize();
   }
 }

We can run the application with above change around an hour already.
Thanks.