Unified Memory not reachable by GPU?

Hello everyone. I am a reasonable self-taught programmer but am trying to learn how to write for GPU’s. I have a Jetson Nano 2GB developer board, running the stock image on a 128GB sd card. I have updated, added libraries, etc, but not modified it much. deviceQuery shows cuda version 10.2, cuda capability 5.3.

I have spent the last few days reading documentation, watching videos, etc, but cannot get unified memory to function as expected. I have tried multiple tutorials and writing from scratch and am still stuck. The simplest example I can post here comes from CoffeeBeforeArch on youtube. Here is his code, from his github:

// This program computer the sum of two N-element vectors using unified memory
// By: Nick from CoffeeBeforeArch

#include <stdio.h>
#include <cassert>
#include <iostream>

using std::cout;

// CUDA kernel for vector addition
// No change when using CUDA unified memory
__global__ void vectorAdd(int *a, int *b, int *c, int N) {
  // Calculate global thread thread ID
  int tid = (blockDim.x * blockIdx.x) + threadIdx.x;

  // Boundary check
  if (tid < N) {
    c[tid] = a[tid] + b[tid];
  }
}

int main() {
  // Array size of 2^16 (65536 elements)
  const int N = 1 << 16;
  size_t bytes = N * sizeof(int);

  // Declare unified memory pointers
  int *a, *b, *c;

  // Allocation memory for these pointers
  cudaMallocManaged(&a, bytes);
  cudaMallocManaged(&b, bytes);
  cudaMallocManaged(&c, bytes);
  
  // Initialize vectors
  for (int i = 0; i < N; i++) {
    a[i] = rand() % 100;
    b[i] = rand() % 100;
  }
  
  // Threads per CTA (1024 threads per CTA)
  int BLOCK_SIZE = 1 << 10;

  // CTAs per Grid
  int GRID_SIZE = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;

  // Call CUDA kernel
  vectorAdd<<<GRID_SIZE, BLOCK_SIZE>>>(a, b, c, N);

  // Wait for all previous operations before using values
  // We need this because we don't get the implicit synchronization of
  // cudaMemcpy like in the original example
  cudaDeviceSynchronize();

  // Verify the result on the CPU
  for (int i = 0; i < N; i++) {
    assert(c[i] == a[i] + b[i]);
  }
  
  // Free unified memory (same as memory allocated with cudaMalloc)
  cudaFree(a);
  cudaFree(b);
  cudaFree(c);

  cout << "COMPLETED SUCCESSFULLY!\n";

  return 0;
}

All this tries to do is establish some shared variables and use the gpu to add them together. The code compiles and runs, but errors out at the assert() check. If I comment out the assert and print out the array contents, I see that A and B are filled fine but all the C’s are zeros. Any ideas what might be going on?

I’ve no idea if this relates, since I don’t know what the trouble is, but I am also dealing with a permission error when I try to use nvprof. Sudo doesn’t resolve it (command not found).

At this point all I’m trying to do is successfully proof-of-concept unified memory before I start in on my real code.

Thanks for any help.

Hi,

We compile your source on a Nano+JetPack4.6 environment.
And it can works correctly.

Could you double-check it again?

nvcc test.cu -o test
./test
COMPLETED SUCCESSFULLY!

Thanks.

Thank you for checking it. I think I have found my problem. I’ll describe it here for those who might have a similar issue. (Please let me know if my diagnosis looks incorrect.)

I was using a longer command line for nvcc based on what I saw the makefile doing and had misused the gpu_architecture & gpu_code flags to the wrong number. The code compiled and ran, but obviously couldn’t find the gpu hardware I had specified so it couldn’t access it and do anything.

On my hardware (described in 1st post) this compile command yields working code:

nvcc vecAdd_1.cu --gpu-architecture=compute_53 --gpu_code=sm_53 -o test

where this was what I was doing:

nvcc vecAdd_1.cu --gpu-architecture=compute_75 --gpu_code=sm_75 -o test

which will compile and run but return only zeros in c.

I’ll also note that

nvcc vecAdd_1.cu -o test

yields working code of essentially the same file size so I was getting nothing for specifying the flags. I wasn’t clear on whether specifying would let the compiler/linker leave unneeded things out and get me a smaller file but it seems not. It seems strange that no other error was produced and the program would have just plowed along thinking the gpu was returning good results, but hey, error checking is usually at least half of the coder’s job.

Thanks again for the help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.