CUDA->Vulkan interop shows (uninitialized memory?) artifacts depending on the value written by CUDA

I’m seeing some odd artifacts trying to do vulkan interop that look like uninitialized memory, the presence of which depend on the exact values I write in my cuda kernel. Sure I’m probably doing something wrong, but no idea what it might be. I’m on a laptop A5500, Ubuntu 22.04 and the behaviour is the same with R535, R780 and R580

I’m doing interop by ([0] indicates main thread, [1] indicates render thread):

  1. [0] Create image in Vulkan with TILING_LINEAR and importing to CUDA as a buffer
  2. [0] Create a timeline semaphore in Vulkan and importing to CUDA
  3. [0] Transition the image from UNDEFINED to GENERAL layout (submit and wait on queue completion)
  4. [1] Write the buffer using the kernel below (simple u/v gradient). note: kernel launch is not from the main thread. cuda context is created in the main thread and then cuCtxSetCurrent() in the render thread before any other cuda calls. All importing of the memory from vulkan happens on the main thread after cuCtxSetCurrent() is first called from the render thread.
  5. [1] cuSignalExternalSemaphoreAsync() the timeline semaphore with a value of 1
  6. [1] cuStreamSynchronize()
  7. [1] cuMemcpyDtoH() the buffer and write to exr (first image shown below).
  8. [0] create a transfer buffer, transition image from GENERAL to TRANSFER_SRC_OPTIMAL, vkCmdCopyImageToBuffer, vkMapMemory, write exr from resulting pointer. Queue waits on the semaphore to reach value 1 in step 5
  9. [0] transition image to SHADER_READ_ONLY_OPTIMAL, use it as a texture on a quad to draw and present to screen.

The image copied to host from CUDA always looks correct. The Vulkan image (either copied to a buffer and mapped, or visually inspected from the presentation) only looks correct when a value other than 0 is written to the fourth component of c. I think I caught it showing some corruption when writing 0.1 but have been unable to reproduce that since.

The kernel:

extern "C" __global__
void write_gradient(float4* base, int width, int height, size_t rowPitchBytes)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)
        return;

    float fx = (float)x / (float)(width  ? width  : 1);
    float fy = (float)y / (float)(height ? height : 1);

    float4 c;
    c.x = fx;      // red = x gradient
    c.y = fy;      // green = y gradient
    c.z = 0.0f;
    c.w = 0.0f;    
    int idx = y * width + x;
    base[idx] = c;
}

output from cuMemcpyDtoH when writing alpha either 0 or 1, and from vkCmdCopyImageToBuffer/present when alpha is 1:

output from vkCmdCopyImageToBuffer when alpha is 0:

This does not indicate uninitialized memory, but undefined behavior due to missing or incorrect synchronization between CUDA and Vulkan.
Without explicit ownership transfer and memory visibility guarantees, the observed values are not meaningful.

Thanks for the reply! Could you explain a bit more what you mean by ownership transfer? I’m just using an external (timeline) semaphore to signal to vulkan that cuda is done, which from the only example I’ve been able to find looks like all that is required. Is there another step I’m missing?

I share memory between vulkan and CUDA and I followed the samples github repo. I use two external sempahores, not one, I have a vulkan to cuda semaphore where cuda waits for a signal and I have a cuda to vulkan semaphore that vulkan waits on.

     checkCudaErrors(cudaWaitExternalSemaphoresAsync(cuda_v2c_semaphore, &params, 1, cuda_stream));
    //... launch optix/kernels
    checkCudaErrors(cudaSignalExternalSemaphoresAsync(cuda_c2v_semaphore, &params, 1, cuda_stream));

on vulkan side I do the same. I am personally not using a timeline semaphore, but those are probably better. There are lot of example Vulkan CUDA interop examples in the CUDA samples repo

On the vulkan side I sync as normal. On the cuda side I also synchrononize my cuda stream prior to signaling the semaphore. I also have these pairs of semaphores per frame in flight, as well as per fif shared memory and per fif cuda streams.