You call cudaAllocMapped(), but you never release the memory.
Either re-use the previous buffer, or release the memory once you’re done with it.
cudaAllocMapped() is a wrapper for cudaHostAlloc() and cudaHostGetDevicePointer().
This is a slow operation, and you’re not supposed to do it each frame. Instead, remember the buffer you receive from the first call to allocMapped(), and re-use it for each frame.
You may in the end want to allocate two or three, and double-buffer or triple-buffer, depending on what the structure of your code and workload is.
I suppose @snarky meant to allocate the CUDA buffer in initialization of your program, before the image processing loop, and free it once all is done. This way, you always have the buffer available in your function called in the loop, no need to allocate/free for each image, it will speed up your code.
Two notes however:
if you have different sizes of images, use the biggest size for allocating the buffer. A small image will make no problem in a bigger buffer.
if you are multi-threading, you may have to set a lock on the buffer or have each thread allocate its own buffer.