CUDA basic tutorial segmentation fault in WSL2 Ubuntu

Hello,

I am trying to run the basic CUDA example from here, https://developer.nvidia.com/blog/even-easier-introduction-cuda/, on an Ubuntu subsystem with the following:

Windows 10 Home build 20270.fe_release
GeForce GTX 1060 6GB
NVIDIA driver version 465.12
WSL Kernal version 5.4.72
Ubuntu 20.04.1 LTS

Output from deviceQuery is:
CUDA Driver Version / Runtime Version 11.2 / 11.0
CUDA Capability Major/Minor version number: 6.1

Output from wsl cat /proc/version:
Linux version 5.4.72-microsoft-standard-WSL2 (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Oct 28 23:40:43 UTC 2020

However, the tutorial script runs into a segmentation fault after cudaDeviceSynchronize(), when it’s trying to do the error checking using the value of y back on the host.

For completeness I copy the tutorial code below:

#include <iostream>
#include <math.h>

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Have checked for errors here
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Program runs up to here fine
  float maxError = 0.0f;
  for (int i = 0; i < N; i++) {
    maxError = fmax(maxError, fabs(y[i]-3.0f)); // SEGMENTATION FAULT HERE
  std::cout << "Max error: " << maxError << std::endl;

  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

I have done extensive error checking throughout the script, in particular wrapping checkCudaErrors around the cudaMallocManaged calls, with no issues found. I have also reproduced it with totally different examples which use the cudaMallocManaged function, and find the same issue where computation runs fine, but accessing the memory back on the host is not possible.

I am also able to successfully compile and run the UnifiedMemory sample program in the cuda tool kit. However, on inspection I find that program doesn’t appear to try and access the memory back on the host after computation, so this makes sense.

It’s also worth noting I can run this example fine on Windows through Visual Studio, it’s only within the Ubuntu subsystem that I can’t.

Please let me know if there are other options I can explore to find the cause of this issue.

Thanks,
James