Segmentation Fault when using UMA and pthreads

Hi Nvidia:
I am trying to use pthread and streams to do parallel calculation.
I used UMA (cudaMallocManaged) to get memories, then sent them into CUDA kernels.
After finishing kernels, when CPU is trying to access UMA allocated memories, segmentation fault is rasied.
But if I run the program in single thread (with or without pthread were the same), the program will successfully executed.
Please help, many thanks.

Jetson AGX Orin Dev. Kit
Jetpack 5.0.2

CUDA version: cuda-11.4
CUDA-GDB Version (/usr/local/cuda-11.4/bin/cuda-gdb --version):
11.4 release
GNU gdb(GDB) 10.4

Code flow is shown below:

typedef struct{
        double *imgSrc;
        double *rslt;
        int streamNum;
}calculateData_t ;

int main()
    // Init memories
    cudaMallocManaged(imgSrc1 , size_for_img );
    cudaMallocManaged(imgSrc2 , size_for_img );
    cudaMallocManaged(rslt1   , size_for_rslt );
    cudaMallocManaged(rslt2   , size_for_rslt );

    // Create Streams
    int nstreams = 2;
    streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));
    for (int i = 0; i < nstreams; i++ )
        cudaStreamCreate(& (streams[i]) );

    calculateData_t data1, data2;
    data1.imgSrc = imgSrc1;
    data1.rslt   = rslt1;
    data1.streamNum = 0;

    data2.imgSrc = imxSrc2;
    data2.rslt   = rslt2;
    data2.streamNum = 1;

    // Create Threads
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    // Free memories and distroy streams

void *calFunctions(void *arg)
    calculateData_t *d = (calculateData_t *) arg;
    double *img   = d->imgSrc;
    double *rslt  = d->rslt;
    int streamNum = d->streamNum;

    // To 
    calFunctionsDispatch(imgSrc, rslt, streamNum);



extern "C" void calFunctionsDispatch(double *img, double *rslt, int streamNum)
    // Calculate grid dim and block dim
    // Start kernel in
    ColSum_16d_parallel<<< griddim, blkdim , 0 ,streams[streamNum]>>>(img, rslt);

    // Wait stream finishing its job

    // Do CPU calculation

    double rslt_value[8] = {};
    for( int i = 0 ; i < 8 ; i ++)
        // Segmentation Fault Rasied
        rslt_value[0] += rslt[i];
    // print out rslt_value

__global__ void ColSum_16d_parallel(const double *img, double *rslt)
	int bid = blockIdx.x;
	int tid = threadIdx.x;
	double rslt_value = 0.0;

	int rsltIdx = bid * colLength + tid;

    // if data_within_range --> add cols value into rslt;
		for (int i = 0 ; i < colLength ; i++ )
			rslt_value += src1[i];

		rslt[rsltIdx] = rslt_value;

Compiling Code:

nvcc -G -g -Xcompiler "-fPIC" -c -o build/taskDispatcher.o
g++ -c -g main.cpp -o build/main.o
g++ -o colSum build/main.o build/taskDispatcher.o -lcudart -L/usr/local/cuda/lib64 -lpthread 

Hi @DaWuLun
Does your program crash when you run it normally or when you run it with under debugger (cuda-gdb) only?

In either case, could you please try running the program with cuda-gdb and share the output?

Hi AKravets
No, program also crashed without debugger.
I also use printf to check where the segmentation fault is rasied. Print out showed the result are the same as debugger, crushed at same place.
Further more, in for loop, segmentation fault is rasied in the first place ( i = 0 )

Print out simply said " Segmentation Fault (core dumped)", but I am not able to allocate the core file after following the instructions on the internet. That’s why I use debugger to check where the fault is rasied.
And debugger said:

Thread 4 "colSum" received signal SIGSEGV, Segmentation fault.

Thank you.

Thank you for the details, looks like the question might be better suited for general CUDA programming forum branch - I have moved your topic there.

Thank you very much!

referring to the jetson memory documentation:

unified memory with concurrent access is not yet supported on iGPU.

You have an instance of concurrent access. To prevent that, use a cudaDeviceSynchronize() call after each kernel call, before using unified memory on the host.

Furthermore, in a multithreaded environment, this is complicated by the fact that each kernel call places all UM allocations in an “untouchable state” from host (CPU) code. A kernel call in thread 0 places UM allocations in a non-accessible state from host code. Other threads attempting to access their UM allocation from host CPU code will hit a seg fault at that point. Therefore you will require pthread synchronization to carefully control the visibility.

Otherwise, you can try using the attachment flags for each stream, to prevent these kinds of issues.

I don’t have a jetson device to work with. I suggest if you need help with Jetson, ask on the Jetson AGX Orin forum.

Hi Robert_Crovella:
Thank you for your reply and documentation.
I modified the UMA to traditional cudaMalloc() and malloc(), seg fault was not rasied, program executed susccessfully. So, root cause should be “UMA with concurrent access is not yet supported on iGPU”.

Also, it is mentioned that:

... each kernel call places ALL UM allocations into "untouchable state" from host code...

As a result, I should use cudaDeviceSynchronize() instread of cudaStreamSynchronize(), to make sure all the streams finished their tasks and UM allocations were “touchable state” from host code. But after changing synchronizing function, the seg fault still raised.
If cudaDeviceSynchronize() is used, host codes should be blocked until all the streams were executed successfully, host code should be able to use UM allcations safely?

Thank you.

Yes, that alone is not sufficient in a multithreaded environment, where you are launching kernels in each thread.

In this UM regime for Jetson (similar to the windows UM regime), when you do a kernel launch, anywhere, in any pthread, all (ordinary) UM allocations at that point become untouchable (from host code). Given the asynchronous nature of pthread execution, this is basically impossible to control. A thread 0 could do a cudaDeviceSynchronize(), and then right after that another thread 1 could do a kernel launch. At that point, even though thread 0 just did a cudaDeviceSynchronize(), the kernel launch in the other thread has made UM allocations unaccessible, again. I view this as impossible to control in the general case, although as I mentioned you could use a complex pthread synchronization scheme, to guarantee access validity (basically: have a thread barrier after cudaDeviceSynchronize() in each thread, so that all threads stop at that point, then have host code access, then another barrier, then go back to kernel launching). That’s beyond the scope of anything I would casually try or would give you a roadmap for.

Instead, my suggestion was to investigate stream-attached memory, as I already indicated and linked. The next entry has a simple recipe/example.

Here is an example of using stream attach to solve the seg fault problem. Note that this simple example only demonstrates the case where each pthread is using an independent set of managed allocations. If managed allocations are shared between two or more pthreads, you cannot use this method.

In this case, I am running on a cc3.5 device, which is in the same UM regime as windows and jetson: concurrent managed access is not supported.

$ cat
#include <pthread.h>
#include <iostream>

using mt = int;
const int num_loops = 1024;
const int nBLK = 2;
const int nTPB = 128;
const int num_pthreads = 4;
const int ds = 1048576;

__global__ void k(mt *data, size_t N){

  for (size_t idx = blockIdx.x*blockDim.x+threadIdx.x; idx < N; idx += gridDim.x*blockDim.x)

struct threadInfo
    mt *data;
    size_t my_N;
    cudaStream_t s;
    mt test;

void *threadFunc(void* arg)
    struct threadInfo* threadInfoStruct;
    threadInfoStruct = (struct threadInfo*) arg;
    for (int i = 0; i < num_loops; i++){
      k<<<nBLK, nTPB, 0, threadInfoStruct->s>>>(threadInfoStruct->data, threadInfoStruct->my_N);
      threadInfoStruct->test = (threadInfoStruct->data)[0];}

int main() {

  int *data[num_pthreads];
  cudaStream_t str[num_pthreads];
  for (int i = 0; i < num_pthreads; i++){
    cudaMallocManaged(data+i, sizeof(mt)*ds);
    for (int j = 0; j < ds; j++) data[i][j] = 0;
    cudaStreamAttachMemAsync(str[i], data[i], 0, cudaMemAttachSingle);
  threadInfo ti[num_pthreads];
  pthread_t threads[num_pthreads];
  for (int i = 0; i < num_pthreads; i++){
    ti[i].data = data[i];
    ti[i].my_N = ds;
    ti[i].s = str[i];
    int rs = pthread_create(threads+i, NULL, threadFunc, (void *) (ti+i));
    if (rs != 0) std::cout << "pthread_create error: " << rs << std::endl;}
  for (int i = 0; i < num_pthreads; i++){
    int rs = pthread_join(threads[i], NULL);
    if (rs != 0) std::cout << "pthread_join error: " << rs << std::endl;}
  for (int i = 0; i < num_pthreads; i++)
    std::cout << "thread: " << i << " expected value: " << num_loops << " final value: " << ti[i].test << std::endl;
  return 0;

$ nvcc -o t2219 -arch=sm_35
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Bus error (core dumped)
$ nvcc -o t2219 -arch=sm_35 -DUSE_STREAM_ATTACH
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
thread: 0 expected value: 1024 final value: 1024
thread: 1 expected value: 1024 final value: 1024
thread: 2 expected value: 1024 final value: 1024
thread: 3 expected value: 1024 final value: 1024

You may wish to refer to the programming guide to learn more about stream memory attachment.

Hi Robert_Crovella:
Thank you for your example and explaination. The example code run perfectly on Jetson AGX Orin. i.e., without -DUSE_STREAM_ATTACH, seg fault will be rasied; on the other hand, with -DUSE_STREAM_ATTACH, seg fault will not be rasied and executed perfectly.
This helps a lot.

Thank you very much!