simple CUDA multi-threading crash on Nano

I am seeing what I think is odd CUDA behaviour on a Jetson Nano. This is on the latest R32.3.1 image but has the same results the previous release as well. It is a very simple application that allocates managed memory in the main thread while executing an empty CUDA function in a separate thread. These two tasks are not related and do not share any information. I feel the two tasks should be able to run in parallel.

Here is the code:

main.cpp:

#include <unistd.h>
#include <future>

#include "TestCuda.cuh"

std::mutex m;

// This works fine with a mutex, but crashes with a sigbus error when not using a mutex
//#define USE_MUTEX

struct MyThread {
	void run() {
		int threadLoop = 0;
		while(1) {
#ifdef USE_MUTEX
			m.lock();
#endif
			printf("Thread Run (loop %d)\n", threadLoop++);
			// run kernel
			testCuda();
#ifdef USE_MUTEX
			m.unlock();
#endif
			usleep(0);
		}
	}
};

int main(int argc, char** argv) {
	MyThread thread;
	auto threadFuture = std::async(std::launch::async, &MyThread::run, thread);
	int loop = 0;
	while(1){
#ifdef USE_MUTEX
		m.lock();
#endif
		int* temp = nullptr;
		printf("*** Main Allocating (loop = %d)\n", loop++);
		cudaError_t err = cudaMallocManaged(&temp, sizeof(int)); // yes, this is a memory leak, but we keep the code simple to demonstrate the issue
		if (err != cudaSuccess) {
			printf("Failed to cudaMallocManaged()\n");
			return -1;
		}
		*temp = 0;	// <-- SIGBUS occurs here if don't use a mutex
		printf("*** Main Finished Allocating value: %d\n", *temp);
#ifdef USE_MUTEX
		m.unlock();
#endif
		usleep(0);
	}
}

TestCuda.cuh:

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void testKernel();

extern "C" {
	void testCuda();
}

TestCuda.cu

#include "TestCuda.cuh"
#include <stdio.h>

__global__ void testKernel() {
	printf ("Kernel running\n");
}

void testCuda() {
	testKernel<<<1,1>>>();
	cudaError_t err = cudaDeviceSynchronize();
	if (err != cudaSuccess) {
		printf("SYNC FAILED\n\n\n");
	}
}

When we run with a mutex, i.e. in main.cpp

#define USE_MUTEX

it works fine, as one would expect and runs forever:

*** Main Allocating (loop = 0)
*** Main Finished Allocating value: 0
Thread Run (loop 0)
Thread Kernel running
*** Main Allocating (loop = 1)
*** Main Finished Allocating value: 0
Thread Run (loop 1)
Thread Kernel running
*** Main Allocating (loop = 2)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 3)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 4)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 5)
*** Main Finished Allocating value: 0
Thread Run (loop 2)
Thread Kernel running
*** Main Allocating (loop = 6)
*** Main Finished Allocating value: 0
Thread Run (loop 3)
Thread Kernel running
*** Main Allocating (loop = 7)
*** Main Finished Allocating value: 0
...

If we comment out the mutex:

//#define USE_MUTEX

and run in parallel we get an arbitrarily timed failure (generally quickly) when trying to dereference this in the main thread:

*temp = 0;	// <-- SIGBUS occurs here if don't use a mutex

i.e.

*** Main Allocating (loop = 0)
Thread Run (loop 0)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 1)
Bus error

Perhaps I missed something simple? Is there a reason why I cannot allocate memory in one thread while executing an independent kernel (that does not touch this memory) in another?

Hi,

Thanks for your reporting.
We will check this and get back to you soon.

Thanks,

For anyone else wanting to see the issue, the code can be found here (consolidated into a single file).
https://github.com/rspruyt/JetsonNanoCudaCrashTest

e.g. (works fine)
make NVCCFLAGS=-DUSE_MUTEX
./TestCuda

or (crashes)
make
./TestCuda

I’ve tested this on multiple Nano’s with both the current (R32.3.1) and previous release. I’ve also happens on the TX2 (R32.2.X), the behavior is the same on all devices. I will continue to investigate further.

Hi,

May I know how do you execute this test?

I try to reproduce this issue with two console, each one run the a ./TestCuda program.
But both app can work correctly without crash. (loop over 30,000)

Do I miss anything?
Thanks

Hi AastaLLL,

Hmmm… it sounds like there is a delta somewhere.

I repeated these steps this morning with a brand new out-of-the box Jetson Nano and SD card, with the same result. For fun, I had a 2nd developer on another team repeat this a 3rd time (with these instructions, but shared nothing else). They used yet another brand new, never used, Jetson Dev Kit and SD card. Their results were the same. This behavior is consistent. We’ve now tested a total of 4 Nano’s and a TX2.

To help close the gap, I’ve added the binary that crashes to the GitHub repo. Perhaps you can try running the checked in ./CudaTest binary I added today? Maybe another community member can verify if they see this issue on a fresh Nano?

I hope I am just making a simple mistake in the code, but I don’t see anything obvious.

To be thorough, here are the exact steps:
On a Windows 10 laptop:

https://www.sdcard.org/downloads/formatter_4/eula_windows/
https://www.balena.io/etcher/
http://code.kliu.org/hashcheck/downloads/HashCheckInstall-latest.exe

1.) Format an SD card with SD Card Formatter (64 GB card)
2.) re-download nv-jetson-nano-sd-card-image-r32.3.1.zip from https://developer.nvidia.com/embedded/downloads (using the text link https://developer.nvidia.com/jetson-nano-sd-card-image-r3231)
4.) Verify MD5 sum is 46b9738a4eabf2e1857f65b1c638275a using free version of Windows HashCheck Shell Extension utility. The MD5 sum matches.
5.) unzip to C:\sd-blob-b01.img
6.) Flash card using Balena Etcher v1.5.71 (latest) with this .img file.
7.) Insert card into the Jetson, power on, go through Jetson setup steps.
8.) Name the device testnano and u/p nvidia/nvidia
9.) SSH into the box:

login as: nvidia
nvidia@testnano's password:
Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 4.9.140-tegra aarch64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

223 packages can be updated.
118 updates are security updates.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

nvidia@testnano:~$ git clone https://github.com/rspruyt/JetsonNanoCudaCrashTest.git
Cloning into 'JetsonNanoCudaCrashTest'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 46 (delta 19), reused 22 (delta 7), pack-reused 0
Unpacking objects: 100% (46/46), done.
nvidia@testnano:~$ cd JetsonNanoCudaCrashTest/
nvidia@testnano:~/JetsonNanoCudaCrashTest$ make
/usr/local/cuda-10.0/bin/nvcc TestCuda.cu -gencode arch=compute_53,code=sm_53 -o TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ ls
Makefile  README.md  TestCuda  TestCuda.cu
nvidia@testnano:~/JetsonNanoCudaCrashTest$ ./TestCuda
*** Main Allocating (loop = 0)
Thread Run (loop 0)
Bus error (core dumped)
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git add TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git commit -am "Adding binary to show nvidia this crashes"

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'nvidia@testnano.(none)')
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git config --global user.email "randy.spruyt@gmail.com"
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git config --global user.name "Randy Spruyt"
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git commit -am "Adding binary to show nvidia this crashes"
[master 0dfbc05] Adding binary to show nvidia this crashes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100755 TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git push
Username for 'https://github.com': randy.spruyt@gmail.com
Password for 'https://randy.spruyt@gmail.com@github.com':
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 213.59 KiB | 1.92 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/rspruyt/JetsonNanoCudaCrashTest.git
   a6f2a8e..0dfbc05  master -> master
nvidia@testnano:~/JetsonNanoCudaCrashTest$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 3.1, GCID: 18186506, BOARD: t210ref, EABI: aarch64, DATE: Tue Dec 10 06:58:34 UTC 2019

Hi,

Thanks for your feedback.
Let me check this issue with another device to see the result.

Thanks.

Hi,

Just a quick thought.

Is there any possibility that the buffer for CUDA kernel is rewritten by the other thread at the same time?
The bus error usually occurs when the two processors access the buffer simultaneously.

Thanks.

Hi AastaLLL,

As you can see from the code, the main thread and child thread do not share any data. The buffer allocated in the main thread is not touched by the child thread launching the kernel. For clarity, the main thread allocates an integer repeatedly, and nothing else. The child thread that executes the CUDA kernel simply prints out a line of text to the console (to show us it ran), and nothing else. The kernel run from the child thread does not touch the memory allocated by the main thread. In other words, both threads are completely independent operations.

If we force the threads to do their work on the GPU serially (using a mutex), this runs fine. If we allow this to happen in parallel (without a mutex, and thus each thread to run as it wishes), even though the main thread (allocation) and child thread (kernel) share no information, we see the SIGBUS error when trying to de-reference the allocated variable on the main thread. Perhaps one of the developers could have a peek at the code?

I hope there is simply a mistake in the code and I have done something wrong, but I do not see one. Otherwise, this looks like a pretty serious issue.

I didn’t start with a fresh install but the binary does crash for me as well.

If it helps, I’ve attached a binary with full debugging turned on, the corefile and a “thread apply all bt full” dump.
TestCuda.zip (1.25 MB)

Thanks for taking the time to verify this, I appreciate it :)

Any thoughts AastaLL? Looks like the polling in libcuda.so judging by the stack trace

Could you kindly provide an update acknowledging that this issue is a problem and is being looked into?
Thanks

Hi RS64, Aasta is looking into the case, however support team is currently on holiday. Sorry for the delay and thanks for your patience.

In the meantime, it would be interesting to know if this issue is specific to cudaMallocManaged() (i.e. does it occur if you use cudaMalloc() or cudaHostAlloc/cudaHostGetDevicePtr instead)

Thanks for have a look dusty_nv,

For the community, I spoke offline with dusty. I can show that the issue seems specific to dereferencing a cudaMallocManaged() allocated pointer. I’ve created a branch in the link below that uses cudaHostAlloc()/cudaHostGetDevicePtr() instead of cudaMallocManaged(), which works as expected. Alternatively old school device/host memory allocation w/ memcpy’s seems fine too.

https://github.com/rspruyt/JetsonNanoCudaCrashTest/blob/HostAllocTest/TestCuda.cu

This is the expected behavior. This platform does not support access of managed memory from the host while ANY kernel is running. This is indicated by the concurrentManagedAccess device property which is queryable via cudaDeviceGetAttribute.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

On other platforma, we implement managed memory by faulting pages between host and device, or sending traffic over a memory fabric like NVLink. This enables concurrent access.

Unfortunately this is not available on Windows or the mobile platforms today.

  • Bryce Adelstein Lelbach, CUDA C++ Core Libraries Lead