simple CUDA multi-threading crash on Nano

RS64 · January 14, 2020, 9:04pm

I am seeing what I think is odd CUDA behaviour on a Jetson Nano. This is on the latest R32.3.1 image but has the same results the previous release as well. It is a very simple application that allocates managed memory in the main thread while executing an empty CUDA function in a separate thread. These two tasks are not related and do not share any information. I feel the two tasks should be able to run in parallel.

Here is the code:

main.cpp:

#include <unistd.h>
#include <future>

#include "TestCuda.cuh"

std::mutex m;

// This works fine with a mutex, but crashes with a sigbus error when not using a mutex
//#define USE_MUTEX

struct MyThread {
	void run() {
		int threadLoop = 0;
		while(1) {
#ifdef USE_MUTEX
			m.lock();
#endif
			printf("Thread Run (loop %d)\n", threadLoop++);
			// run kernel
			testCuda();
#ifdef USE_MUTEX
			m.unlock();
#endif
			usleep(0);
		}
	}
};

int main(int argc, char** argv) {
	MyThread thread;
	auto threadFuture = std::async(std::launch::async, &MyThread::run, thread);
	int loop = 0;
	while(1){
#ifdef USE_MUTEX
		m.lock();
#endif
		int* temp = nullptr;
		printf("*** Main Allocating (loop = %d)\n", loop++);
		cudaError_t err = cudaMallocManaged(&temp, sizeof(int)); // yes, this is a memory leak, but we keep the code simple to demonstrate the issue
		if (err != cudaSuccess) {
			printf("Failed to cudaMallocManaged()\n");
			return -1;
		}
		*temp = 0;	// <-- SIGBUS occurs here if don't use a mutex
		printf("*** Main Finished Allocating value: %d\n", *temp);
#ifdef USE_MUTEX
		m.unlock();
#endif
		usleep(0);
	}
}

TestCuda.cuh:

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void testKernel();

extern "C" {
	void testCuda();
}

TestCuda.cu

#include "TestCuda.cuh"
#include <stdio.h>

__global__ void testKernel() {
	printf ("Kernel running\n");
}

void testCuda() {
	testKernel<<<1,1>>>();
	cudaError_t err = cudaDeviceSynchronize();
	if (err != cudaSuccess) {
		printf("SYNC FAILED\n\n\n");
	}
}

When we run with a mutex, i.e. in main.cpp

#define USE_MUTEX

it works fine, as one would expect and runs forever:

*** Main Allocating (loop = 0)
*** Main Finished Allocating value: 0
Thread Run (loop 0)
Thread Kernel running
*** Main Allocating (loop = 1)
*** Main Finished Allocating value: 0
Thread Run (loop 1)
Thread Kernel running
*** Main Allocating (loop = 2)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 3)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 4)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 5)
*** Main Finished Allocating value: 0
Thread Run (loop 2)
Thread Kernel running
*** Main Allocating (loop = 6)
*** Main Finished Allocating value: 0
Thread Run (loop 3)
Thread Kernel running
*** Main Allocating (loop = 7)
*** Main Finished Allocating value: 0
...

If we comment out the mutex:

//#define USE_MUTEX

and run in parallel we get an arbitrarily timed failure (generally quickly) when trying to dereference this in the main thread:

*temp = 0;	// <-- SIGBUS occurs here if don't use a mutex

i.e.

*** Main Allocating (loop = 0)
Thread Run (loop 0)
*** Main Finished Allocating value: 0
*** Main Allocating (loop = 1)
Bus error

Perhaps I missed something simple? Is there a reason why I cannot allocate memory in one thread while executing an independent kernel (that does not touch this memory) in another?

AastaLLL · January 15, 2020, 2:44am

Hi,

Thanks for your reporting.
We will check this and get back to you soon.

RS64 · January 15, 2020, 2:17pm

Thanks,

For anyone else wanting to see the issue, the code can be found here (consolidated into a single file).

e.g. (works fine)
make NVCCFLAGS=-DUSE_MUTEX
./TestCuda

or (crashes)
make
./TestCuda

I’ve tested this on multiple Nano’s with both the current (R32.3.1) and previous release. I’ve also happens on the TX2 (R32.2.X), the behavior is the same on all devices. I will continue to investigate further.

AastaLLL · January 16, 2020, 7:18am

Hi,

May I know how do you execute this test?

I try to reproduce this issue with two console, each one run the a ./TestCuda program.
But both app can work correctly without crash. (loop over 30,000)

Do I miss anything?
Thanks

RS64 · January 16, 2020, 4:18pm

Hi AastaLLL,

Hmmm… it sounds like there is a delta somewhere.

I repeated these steps this morning with a brand new out-of-the box Jetson Nano and SD card, with the same result. For fun, I had a 2nd developer on another team repeat this a 3rd time (with these instructions, but shared nothing else). They used yet another brand new, never used, Jetson Dev Kit and SD card. Their results were the same. This behavior is consistent. We’ve now tested a total of 4 Nano’s and a TX2.

To help close the gap, I’ve added the binary that crashes to the GitHub repo. Perhaps you can try running the checked in ./CudaTest binary I added today? Maybe another community member can verify if they see this issue on a fresh Nano?

I hope I am just making a simple mistake in the code, but I don’t see anything obvious.

To be thorough, here are the exact steps:
On a Windows 10 laptop:

http://code.kliu.org/hashcheck/downloads/HashCheckInstall-latest.exe

1.) Format an SD card with SD Card Formatter (64 GB card)
2.) re-download nv-jetson-nano-sd-card-image-r32.3.1.zip from Jetson Download Center | NVIDIA Developer (using the text link https://developer.nvidia.com/jetson-nano-sd-card-image-r3231)
4.) Verify MD5 sum is 46b9738a4eabf2e1857f65b1c638275a using free version of Windows HashCheck Shell Extension utility. The MD5 sum matches.
5.) unzip to C:\sd-blob-b01.img
6.) Flash card using Balena Etcher v1.5.71 (latest) with this .img file.
7.) Insert card into the Jetson, power on, go through Jetson setup steps.
8.) Name the device testnano and u/p nvidia/nvidia
9.) SSH into the box:

login as: nvidia
nvidia@testnano's password:
Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 4.9.140-tegra aarch64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

223 packages can be updated.
118 updates are security updates.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

nvidia@testnano:~$ git clone https://github.com/rspruyt/JetsonNanoCudaCrashTest.git
Cloning into 'JetsonNanoCudaCrashTest'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 46 (delta 19), reused 22 (delta 7), pack-reused 0
Unpacking objects: 100% (46/46), done.
nvidia@testnano:~$ cd JetsonNanoCudaCrashTest/
nvidia@testnano:~/JetsonNanoCudaCrashTest$ make
/usr/local/cuda-10.0/bin/nvcc TestCuda.cu -gencode arch=compute_53,code=sm_53 -o TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ ls
Makefile  README.md  TestCuda  TestCuda.cu
nvidia@testnano:~/JetsonNanoCudaCrashTest$ ./TestCuda
*** Main Allocating (loop = 0)
Thread Run (loop 0)
Bus error (core dumped)
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git add TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git commit -am "Adding binary to show nvidia this crashes"

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'nvidia@testnano.(none)')
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git config --global user.email "randy.spruyt@gmail.com"
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git config --global user.name "Randy Spruyt"
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git commit -am "Adding binary to show nvidia this crashes"
[master 0dfbc05] Adding binary to show nvidia this crashes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100755 TestCuda
nvidia@testnano:~/JetsonNanoCudaCrashTest$ git push
Username for 'https://github.com': randy.spruyt@gmail.com
Password for 'https://randy.spruyt@gmail.com@github.com':
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 213.59 KiB | 1.92 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/rspruyt/JetsonNanoCudaCrashTest.git
   a6f2a8e..0dfbc05  master -> master
nvidia@testnano:~/JetsonNanoCudaCrashTest$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 3.1, GCID: 18186506, BOARD: t210ref, EABI: aarch64, DATE: Tue Dec 10 06:58:34 UTC 2019

AastaLLL · January 17, 2020, 4:40am

Hi,

Thanks for your feedback.
Let me check this issue with another device to see the result.

Thanks.

AastaLLL · January 17, 2020, 4:51am

Hi,

Just a quick thought.

Is there any possibility that the buffer for CUDA kernel is rewritten by the other thread at the same time?
The bus error usually occurs when the two processors access the buffer simultaneously.

Thanks.

RS64 · January 17, 2020, 1:59pm

Hi AastaLLL,

As you can see from the code, the main thread and child thread do not share any data. The buffer allocated in the main thread is not touched by the child thread launching the kernel. For clarity, the main thread allocates an integer repeatedly, and nothing else. The child thread that executes the CUDA kernel simply prints out a line of text to the console (to show us it ran), and nothing else. The kernel run from the child thread does not touch the memory allocated by the main thread. In other words, both threads are completely independent operations.

If we force the threads to do their work on the GPU serially (using a mutex), this runs fine. If we allow this to happen in parallel (without a mutex, and thus each thread to run as it wishes), even though the main thread (allocation) and child thread (kernel) share no information, we see the SIGBUS error when trying to de-reference the allocated variable on the main thread. Perhaps one of the developers could have a peek at the code?

I hope there is simply a mistake in the code and I have done something wrong, but I do not see one. Otherwise, this looks like a pretty serious issue.

gtj · January 17, 2020, 2:34pm

I didn’t start with a fresh install but the binary does crash for me as well.

If it helps, I’ve attached a binary with full debugging turned on, the corefile and a “thread apply all bt full” dump.
TestCuda.zip (1.25 MB)

RS64 · January 17, 2020, 2:52pm

Thanks for taking the time to verify this, I appreciate it :)

RS64 · January 21, 2020, 2:03pm

Any thoughts AastaLL? Looks like the polling in libcuda.so judging by the stack trace

RS64 · January 27, 2020, 1:31pm

Could you kindly provide an update acknowledging that this issue is a problem and is being looked into?
Thanks

dusty_nv · January 27, 2020, 5:16pm

Hi RS64, Aasta is looking into the case, however support team is currently on holiday. Sorry for the delay and thanks for your patience.

In the meantime, it would be interesting to know if this issue is specific to cudaMallocManaged() (i.e. does it occur if you use cudaMalloc() or cudaHostAlloc/cudaHostGetDevicePtr instead)

RS64 · January 28, 2020, 12:07am

Thanks for have a look dusty_nv,

For the community, I spoke offline with dusty. I can show that the issue seems specific to dereferencing a cudaMallocManaged() allocated pointer. I’ve created a branch in the link below that uses cudaHostAlloc()/cudaHostGetDevicePtr() instead of cudaMallocManaged(), which works as expected. Alternatively old school device/host memory allocation w/ memcpy’s seems fine too.

github.com

rspruyt/JetsonNanoCudaCrashTest/blob/HostAllocTest/TestCuda.cu

#include <cuda.h>
#include <cuda_runtime.h>
#include <unistd.h>
#include <future>
#include <mutex>
#include <stdio.h>

// This works fine with a mutex, but crashes with a sigbus error when not using a mutex
// #define USE_MUTEX

#ifdef USE_MUTEX
std::mutex m;
#endif

__global__ void testKernel() {
	printf("Thread Kernel running\n");
}

void testCuda() {
	testKernel<<<1,1>>>();

This file has been truncated. show original

blelbach · January 29, 2020, 10:46pm

This is the expected behavior. This platform does not support access of managed memory from the host while ANY kernel is running. This is indicated by the concurrentManagedAccess device property which is queryable via cudaDeviceGetAttribute.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

On other platforma, we implement managed memory by faulting pages between host and device, or sending traffic over a memory fabric like NVLink. This enables concurrent access.

Unfortunately this is not available on Windows or the mobile platforms today.

Bryce Adelstein Lelbach, CUDA C++ Core Libraries Lead

Topic		Replies	Views
Using unified memory causes system crash CUDA Programming and Performance	28	5887	February 4, 2019
Install Problem CUDA Programming and Performance	32	12708	December 17, 2009
Peculiar Shared Memory Behavior - NEED HELP! A test program appears to have two arrays sharing v CUDA Programming and Performance	21	2159	June 5, 2011
Bad performance or bad coding? CUDA Programming and Performance	21	768	October 12, 2021
Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit Jetson AGX Orin gpu , kubernetes	15	194	April 20, 2025
All CUDA-capable devices busy or unavailable Jetson TX2 cuda	9	4008	December 28, 2021
Jetson Orin Nano Dev Board Pods Stuck in ContainersCreating State Jetson Orin Nano docker , kubernetes	7	284	July 30, 2024
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64364	April 20, 2011
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3045	November 13, 2017
Segmentation Fault when using UMA and pthreads CUDA Programming and Performance cuda , ubuntu	10	1180	March 29, 2023

simple CUDA multi-threading crash on Nano

Related topics