Hello everyone. I am a reasonable self-taught programmer but am trying to learn how to write for GPU’s. I have a Jetson Nano 2GB developer board, running the stock image on a 128GB sd card. I have updated, added libraries, etc, but not modified it much. deviceQuery shows cuda version 10.2, cuda capability 5.3.
I have spent the last few days reading documentation, watching videos, etc, but cannot get unified memory to function as expected. I have tried multiple tutorials and writing from scratch and am still stuck. The simplest example I can post here comes from CoffeeBeforeArch on youtube. Here is his code, from his github:
// This program computer the sum of two N-element vectors using unified memory
// By: Nick from CoffeeBeforeArch
#include <stdio.h>
#include <cassert>
#include <iostream>
using std::cout;
// CUDA kernel for vector addition
// No change when using CUDA unified memory
__global__ void vectorAdd(int *a, int *b, int *c, int N) {
// Calculate global thread thread ID
int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
// Boundary check
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
int main() {
// Array size of 2^16 (65536 elements)
const int N = 1 << 16;
size_t bytes = N * sizeof(int);
// Declare unified memory pointers
int *a, *b, *c;
// Allocation memory for these pointers
cudaMallocManaged(&a, bytes);
cudaMallocManaged(&b, bytes);
cudaMallocManaged(&c, bytes);
// Initialize vectors
for (int i = 0; i < N; i++) {
a[i] = rand() % 100;
b[i] = rand() % 100;
}
// Threads per CTA (1024 threads per CTA)
int BLOCK_SIZE = 1 << 10;
// CTAs per Grid
int GRID_SIZE = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
// Call CUDA kernel
vectorAdd<<<GRID_SIZE, BLOCK_SIZE>>>(a, b, c, N);
// Wait for all previous operations before using values
// We need this because we don't get the implicit synchronization of
// cudaMemcpy like in the original example
cudaDeviceSynchronize();
// Verify the result on the CPU
for (int i = 0; i < N; i++) {
assert(c[i] == a[i] + b[i]);
}
// Free unified memory (same as memory allocated with cudaMalloc)
cudaFree(a);
cudaFree(b);
cudaFree(c);
cout << "COMPLETED SUCCESSFULLY!\n";
return 0;
}
All this tries to do is establish some shared variables and use the gpu to add them together. The code compiles and runs, but errors out at the assert() check. If I comment out the assert and print out the array contents, I see that A and B are filled fine but all the C’s are zeros. Any ideas what might be going on?
I’ve no idea if this relates, since I don’t know what the trouble is, but I am also dealing with a permission error when I try to use nvprof. Sudo doesn’t resolve it (command not found).
At this point all I’m trying to do is successfully proof-of-concept unified memory before I start in on my real code.
Thanks for any help.