nvEncRegisterResource fails when using cuMemAllocAsync

I’m currently trying to implement stream ordered memory allocations (i.e. cuMemAllocAsync/cuMemFreeAsync) in some code which is using NvEnc. But I ran into an issue. When I change the allocation of NvEnc input buffers from using cuMemAllocPitched to cuMemAllocAsync (as there is no pitched variant of the async allocator) I get an error NV_ENC_ERR_RESOURCE_REGISTER_FAILED from nvEncRegisterResource.

My first guess was that it was missing the padding, but allocating memory using cuMemAlloc works fine.

My second guess was that it requires the buffer to have been properly created before calling nvEncRegisterResource, so I inserted a cuStreamSynchronize between the allocation and register call, but without any luck.

The relevant parts of the code:

CUdeviceptr dEncoderInputBuffer;

// Old, working way:
// size_t pitch = 0;
// CUresult cuStatus = cuMemAllocPitch(&dEncoderInputBuffer, &pitch, getWidthInBytes(width, NV_ENC_BUFFER_FORMAT_NV12), lumaHeight + chromaHeight, 16);

// This also works, but is not async:
// size_t pitch = getWidthInBytes(width, NV_ENC_BUFFER_FORMAT_NV12)
// CUresult cuStatus = cuMemAlloc(&dEncoderInputBuffer, pitch * (lumaHeight + chromaHeight));

// This is what I want to work:
size_t pitch = getWidthInBytes(width, NV_ENC_BUFFER_FORMAT_NV12)
CUresult cuStatus = cuMemAllocAsync(&dEncoderInputBuffer, pitch * (lumaHeight + chromaHeight), cudaStream);

if (cuStatus != CUDA_SUCCESS) {
    std::cout << "Failed to allocate encoder memory, error code: " << cuStatus << "\n";
    return false;

// Adding this does not seem to help
// cuStatus = cuStreamSynchronize(cudaStream);
// if (cuStatus != CUDA_SUCCESS) {
//     std::cout << "Failed to  synchronize with CUDA stream, error code: " << cuStatus << "\n";
//     return false;
// }

NV_ENC_REGISTER_RESOURCE registerResource{};
registerResource.version = NV_ENC_REGISTER_RESOURCE_VER;
registerResource.resourceToRegister = reinterpret_cast<void*>(dEncoderInputBuffer);
registerResource.width = width;
registerResource.height = height;
registerResource.pitch = pitch;
registerResource.bufferFormat = NV_ENC_BUFFER_FORMAT_NV12;
registerResource.bufferUsage = NV_ENC_INPUT_IMAGE;
registerResource.pInputFencePoint = nullptr;
NVENCSTATUS status = apiInstance.nvEncRegisterResource(encoderHandle, &registerResource);
if (status != NV_ENC_SUCCESS) {
    std::cout << "Failed to register CUDA buffer {}" << status << "\n";
    return false;

My system is a Ubuntu 22.04 with an RTX A2000. CUDA toolkit 12.2.

I got the impression that CUDA memory allocated asynchronously could be used in the exact same way as synchronously allocated CUDA memory (as long as you keep things in the correct stream). Isn’t that true?
Is this supposed to work, or is there a limitation in the NvEnc API to only use synchronously allocated memory?
If so, how can I avoid the device synchronization every time I start a new encoder instance and have to allocate input buffers for it?