Managed Memory Access crash on Tegra

I have been experiencing a segfault or ‘bus error’ on a Jetson Xavier AGX. I have a pretty complicated application, but I managed to distill the crash as being caused by the following access pattern:
There are at least two threads, each with there own private managed memory buffers. They write to their respective buffers under both cpu and gpu code. This is code reproduces the error. It does not crash on my desktop (GTX 1070 Ti), but does on my Xavier AGX

#include <cuda_runtime.h>
#include <thread>
#include <mutex>
#include <unistd.h>
#include <cstring>
#include <cassert>

#define getLastCudaError(msg) __getLastCudaError(msg, __FILE__, __LINE__)
inline void __getLastCudaError(const char *errorMessage, const char *file, const int line) {
    cudaError_t err = cudaGetLastError();
    if (cudaSuccess != err) {
		fprintf(stderr, "%s(%i) : getLastCudaError() CUDA error : %s : (%d) %s.\n", file, line, errorMessage, (int)err, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

namespace {
const size_t bufLen = 2048;

// If this mutex is locked, there is no crash!
std::mutex mtx;

// The only difference between loop1 and loop2, is that the second
// uses cudaMemset.

void loop1() {
	float *dataBlock2 = nullptr;
	cudaMallocManaged(&dataBlock2, sizeof(float)*bufLen);

	for (int i=0; ; i++) {
		{
			//std::lock_guard<std::mutex> lck(mtx);
			for (int i=0; i<bufLen; i++) dataBlock2[i] += 1.0f;
			cudaDeviceSynchronize();
			getLastCudaError("post warp");
		}

		usleep(10'000);
		if (i % 10 == 0) printf(" - fast : %d\n", i);
	}
	cudaFree(dataBlock2);
}

void loop2() {
	float *dataBlock2 = nullptr;
	cudaMallocManaged(&dataBlock2, sizeof(float)*bufLen);

	for (int i=0; ; i++) {
		{
			//std::lock_guard<std::mutex> lck(mtx);
			for (int i=0; i<bufLen; i++) dataBlock2[i] += 1.0f;
			cudaMemset(dataBlock2, 0, sizeof(float)*bufLen/2);
			cudaDeviceSynchronize();
			getLastCudaError("post warp");
		}

		usleep(10'000);
		if (i % 10 == 0) printf(" - fast : %d\n", i);
	}
	cudaFree(dataBlock2);
}
}

int main() {
	//
	// This does not crash. It requires using cudaMemset in at least one thread.
	// std::thread thread1(&loop1), thread2(&loop1);

	std::thread thread1(&loop1), thread2(&loop2);

	thread1.join();
	thread2.join();

	return 0;
}

As noted in a comment, the crash only happens when cudaMemset is called in one or both of the threads. Sometimes the exception is a segfault, sometimes a bus error.

Also, if I force the writes to be synchronous by locking with a mutex, there is no crash. Uncomment the locks to see the program work.

I expect that any memory allocated using cudaMallocManaged() is paged to the device needed, but the issue seems to be stemming from such a transfer, and somehow only when accessing data from different threads. Am I missing a piece of documentation that tells me not to use managed memory in different threads, even if the buffers don’t overlap?

I think if you query the property concurrentManagedAccess on your Jetson you will find that it is not supported.

As a result, as soon as you launch a kernel (from any code running on the device) access to any and all managed allocations from host code is suspended - it becomes illegal to do so, until a subsequent cudaDeviceSynchronize() is encountered.

cudaMemset may launch a kernel under the hood, in order to set device memory. The cudaMemset operation in loop2 is presumably launching a kernel, and for the duration of that kernel execution, it will be illegal for any host code to access a managed allocation. But without the mutex, loop1 may do just that in the for-loop.

The article I linked has one possible suggestion to work around this on Jetson - use pinned memory instead of managed memory. This document may also be of interest.

So what you are attempting to do with managed memory on your Jetson is not allowed.

For Jetson AGX Xavier specific question, you may wish to post those on the dedicated forum.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.