SOLVED (sort of): cudaMalloc fails where cudaMallocManaged succeeds

Trying to wrap my head around this one.

Short example. This is running on the Jetson TX2 (256 Pascal cores). I still can’t get remote debugging to step past any GPU/cuda calls (Eclipse just hangs and I have to ssh in and kill things), so I’m doing the caveman thing.

Greatly abbreviated code; may have semantic errors as I’m typing it in this post. This is not my actual running code, but shows what I think I’m seeing. I am using a lot of memory on the device. The reason for wanting to switch from cudaMallocManaged to cudaMalloc is that the logic that’s going inside of my device function takes 2.5 seconds when performing the memory write operation. If I have it do some other computation, the function executes in a couple hundred milliseconds. I am randomly writing to all addresses in the array for image processing and I believe that something in the memory management is adversely impacting performance. I’m still 100x faster than the desktop, but my nose tells me I need to be explicit with memory assignment…

I am not accessing array1 or array2 outside of device code yet. They are intermediate steps that then get merged in a later device operation.

#define WIDTH 4000
#define HEIGHT 2000

__device__ deviceDoSomething(float *arr1, float *arr2)
{
    unsigned x = blockIdx.x*blockDim.x+threadIdx.x;
    unsigned y = blockIdx.y*blockDim.y+threadIdx.y;

    if (x > WIDTH || y > HEIGHT)
         return;

    array1[y*WIDTH+x] = x; //filler code for the example. 
    array2[y*WIDTH+x] = y;
}

void doSomething()
{
    float *array1;
    uint32_t *array2;

    const int arraySize = WIDTH*HEIGHT*sizeof(float);

    CUDA_CHECK_RETURN(cudaMalloc(&array1, arraySize));
    CUDA_CHECK_RETURN(cudaMalloc(&array2, WIDTH*HEIGHT*sizeof(uint32_t))); //If I use cudaMallocManaged here, I get expected results and no error. cudaMalloc gives the exception later in the code. 

    dim3 blockDimensions(32, 16); //Read that you want a multiple of 32 and <= total number of GPU cores
    dim3 blockCount((WIDTH + blockDimensions.x)/blockDimensions.x, (HEIGHT + blockDimensions.y)/blockDimensions.y);
 
    CUDA_CHECK_RETURN(cudaDeviceSynchronize()); //Possibly not necessary... testing
    deviceFunctionA<<<blockCount, blockDimensions>>>(array1, array2);
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    deviceDoSomething<<<blockCount, blockDimensions>>>(array1, array2);
    CUDA_CHECK_RETURN(cudaDeviceSynchronize()); //Here's where I get my exception. 
//If array1 is initialized using cudaMalloc, there are no issues. 
//However, if I initialize array2 using cudaMalloc and not cudaMallocManaged, 
//and assign a value to array2 in my __device__ function, 
//I get the following error: cudaDeviceSynchronize() returned unspecified launch failure(4) at...
}

Update, in my original code, I had array2[x*WIDTH+y] as my indexing expression. Interestingly enough, the array created with cudaMallocManaged worked as I’m guessing there was enough available memory for it to not complain. max(x) * WIDTH is greater than the array dimensions.

Next step is to figure out a faster way to do non-uniform memory operations.