Thanks for your response, saulocpp! Sorry for this long post. I’m really hoping to get to the bottom of this issue. Also note that this code runs fine on Linux systems (specifically the Xavier dev kit).
Here is my effort to track down the errors.
This works great, no problems running this to the end:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h> // used for rand()
#include "cuda_err_check.h"
//============================= GPU Kernels ====================================
/*
The "__global__" tag tells nvcc that the function will execute on the device
but will be called from the host. Notice that we must use pointers!
*/
/*
* function: add_vec
* purpose: add two vectors on GPU
* PARAMETERS:
* a - first array
* b - second array
* c - output array
*/
__global__ void cuda_add(float *a, float *b, float *c, unsigned long long N) {
// assign tid by using block id, block dimension, and thread id
unsigned long long tid = blockIdx.x * blockDim.x + threadIdx.x;
// stride is for big arrays, i.e. bigger than threads we have
unsigned long long stride = blockDim.x * gridDim.x;
// do the operations
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += stride;
}
}
int main() {
// need to "wake up" the api. This adsorbs the startup overhead that was
// biasing my results
cudaFree(0);
// size of the array
const unsigned long long N = 10000;
printf("Array Size: %d\n", N);
// DEVICE PROPERTIES
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// print a couple of many properties
printf("Max Threads per block: %d\n", prop.maxThreadsPerBlock);
// using max number of threads in the x dim possible
int nThreads = prop.maxThreadsDim[0];
printf("nThreads: %d\n", nThreads);
// calculate number of blocks based on the number of threads
int nBlocks = (N + nThreads - 1) / nThreads;
printf("nBlocks: %d\n", nBlocks);
float *a, *b, *c;
// allocate memory once before the iterations
CudaSafeCall( cudaMallocManaged(&a, sizeof(float) * N) );
CudaSafeCall( cudaMallocManaged(&b, sizeof(float) * N) );
CudaSafeCall( cudaMallocManaged(&c, sizeof(float) * N) );
CudaSafeCall(cudaDeviceSynchronize());
printf("Running on GPU\n");
// run the kernel
cuda_add <<< nBlocks, nThreads >>> (a, b, c, N);
CudaCheckError();
// wait for device to finish
CudaSafeCall(cudaDeviceSynchronize());
CudaSafeCall( cudaFree(a) );
CudaSafeCall( cudaFree(b) );
CudaSafeCall( cudaFree(c) );
// ============== END ==================
return 0;
}
However, when I add code in to assign values to variables a and b, the system crashes. This chunk of code is added after the cudaMallocManaged() calls and the cudaDeviceSynchronize() call.
// create vectors.
for (unsigned long long i = 0; i < N; i++) {
// actual values don't matter, as long as they're floats.
a[i] = 1.0f;
b[i] = 2.0f;
}
CudaSafeCall(cudaDeviceSynchronize());
So then, as suggested, I replace the assignments with calls to cudaMemset(). This allows the program to run again:
CudaSafeCall( cudaMemset(a, 0, N * sizeof(float)) );
CudaSafeCall( cudaMemset(b, 0, N * sizeof(float)) );
CudaSafeCall( cudaDeviceSynchronize() );
Now I’d like to check for errors in the adding kernel. Back in the main function, after the kernel call and a subsequent call to cudaDeviceSynchronize(), I add a simple check of the first element that causes the system to crash once again:
printf("Error: %f\n", (c[0] - a[0] + b[0]));
I then became curious about the pointer itself. Here is my attempt to print the pointer address for nBlocks, a, b, and c. It is very possible the rest of this post is irrelevant to diagnosing the problem.
printf("Pointer to nThreads: 0x%p\n", &nThreads);
printf("Pointer to nBlocks: 0x%p\n", &nBlocks);
printf("Pointer to a: 0x%p\n", a);
printf("Pointer to b: 0x%p\n", b);
printf("Pointer to c: 0x%p\n", c);
This yields the following output:
Pointer to nThreads: 0x0000009821EFF944
Pointer to nBlocks: 0x0000009821EFF964
Pointer to a: 0x0000000701400000
Pointer to b: 0x0000000701410000
Pointer to c: 0x0000000701420000
This made me curious about the pointers. I therefore got the pointer attributes using the CUDA API function cudaPointerAttributes() https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaPointerAttributes.html#structcudaPointerAttributes.
This gives me the following information:
memoryType cudaMemoryTypeDevice (2) cudaMemoryType
device 0 int
devicePointer 0x0000000705600000 void *
hostPointer 0x0000000705600000 void *
isManaged 1 int
What this tells me is that the host pointer and the device pointer are the same. From the Unified Addressing documentation, “For these devices there is no distinction between a device pointer and a host pointer – the same pointer value may be used to access memory from the host program and from a kernel running on the device.” I have verified that unified memory is enabled on my device. Reading that documentation tells me that “All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing.” Using this information, I allocated memory with cudaMallocHost() instead of cudaMallocManaged(). My program than ran fine. Here is the pointer information from cudaPointerAttributes():
memoryType cudaMemoryTypeHost (1) cudaMemoryType
device 0 int
devicePointer 0x0000000204a00000 void *
hostPointer 0x0000000204a00000 void *
isManaged 0 int
This is interesting because the pointer information now says that “isManaged=0,” however my program runs fine with zero computation errors.
Finally for completeness, I used cudaMalloc() to get device pointers. cudaPointerAttributes() yields:
memoryType cudaMemoryTypeDevice (2) cudaMemoryType
device 0 int
devicePointer 0x0000000705600000 void *
hostPointer 0x0000000000000000 void *
isManaged 0 int
This is how I would expect it to look. The hostPointer is null and “isManaged=0.”
So now my questions are the following:
- Why does cudaMallocManaged() return significantly different host pointers than cudaMallocHost?
- Why would accessing a pointer location cause my system to crash?
- Any idea how I can diagnose this problem further? Can't really do too much debugging when my computer freezes
- Does using cudaMallocHost() with a unified memory enabled device work the same as cudaMallocManaged()? I.e. can I bypass using cudaMallocManaged() by using cudaMallocHost()?
One more piece of information: the code will run to completion after a system reboot and fresh build. However, it will not run to completion a second time. It just freezes with the same symptoms as mentioned before.