cudaMemcpy OpenCV GpuMat to memory of Triton Inference Server

I am struggling with a GpuMat conversion to the Triton Inference Server. I want to copy data of a GpuMat to the shared memory of the inference server.

The image of this example is a 600 * 600 * 3 floating point image.

I first tried with a cv::Mat, that works well.

cv::Mat image;
// Do things with image
size_t inputSize = image.total() * image*elemSize();  // 360000 * 12 = 4320000
cudaError error = cudaMemcpy((void*)(inferenceSharedMemoryPtr, image.ptr(0), inputSize, cudaMemcpyHostToDevice);

Now, we process the image on the gpu. We allocate memory on the gpu first with cudaMalloc3D and then use the pointer to the allocated memory to create an GpuMat. The cudaExtent we use for cudaMalloc3D has a depth of 1, a height of the image height and a width of image width * 3 * sizeof(float) (three ‘colors’ and the values are converted to floats, because that’s what the inference server expects).
The GpuMat is created like:

cv::cuda::GpuMat gpuImage(height, width * 3, CV_32FC1, sharedMemoryPointer, pitch)

Where the pitch is retrieved from the cudaMalloc3D call.
Height is 600, width is 7200 (600 * 3 * sizeof(float)), pitch is 7680. Shared memory pointer is the pointer returned from the cudaMalloc3D call.

Then, we want to memcpy the data from the GpuMat to the shared memory of the Triton Inference Server. The Inference Server expects continuous data and of course, the GpuMat is not. So the question is: how to memcpy the data so it becomes continuous and the Inference Server can use it?

I tried with

cudaError error = cudaMemcpy2D((void*)inferenceSharedMemoryPtr, width, sharedMemoryPointer, pitch, width, height, cudaMemcpyDeviceToDevice);

Where width is 7200 (600 * 3 * sizeof(float)), height is 600, pitch is 7680, inferenceSharedMemoryPointer is the pointer on the inference server to the shared memory and sharedMemoryPointer is the pointer from the cudaMalloc3D call.
I get results from the inference server, but they are far from correct. When doing a memcpy of the same image from ‘normal memory’ to the inference server (so from a cv::Mat as in the first example), everything works well.

So I think the cudamemcpy is not going well. It might just copy the wrong data. I don’t know if that should work but after doing a cudamemcpy, I created a GpuMat pointing to the just copied data on the Inference Server, downloaded it, but it is all black. But I am not sure if that should work anyway.

and what also does not work (but seems logical to do because of the pointer to the image data):

cudaError error = cudaMemcpy2D((void*)inferenceSharedMemoryPtr, width, gpuImage.data, pitch, width, height, cudaMemcpyDeviceToDevice);

Where width is 7200 (600 * 3 * sizeof(float)), height is 600, pitch is 7680, inferenceSharedMemoryPointer is the pointer on the inference server to the shared memory.

The cudaMemCpy2D call results in an error: cudaErrorInvalidValue. Same with gpuImage.ptr(0).

Does anyone know how to do this?

Thanks in advance!

PS I don’t know if it’s a useful hint but when I download and upload the image with:

cv::cuda::GpuMat test(600, 600, CV_32FC3, (char*)gpuImage.data);  // I know this is a bit strange but at that point in the code I only have a pointer to the shared memory of the gpumat and not the gpu mat itself.
cv::Mat downloaded;
test.download(downloaded);
cudaMemcpy((void*)inferenceSharedMemoryPtr, test.ptr(0), inputSize, cudaMemcpyHostToDevice); // inputSize = 360000 * 12 = 4320000

then it works like expected. But that’s of course not what I want: I want to keep everything on the GPU.

What kind of data structure is GpuMat? If it is one of these pointer-to-array-of-row-pointers constructs you will need to gather the rows into a contiguous 2D array, because you cannot make any assumptions about the starting address of each row of the source matrix. That means performing numberOfRows 1D copies, one for each row.

Yes it is not continuous.

I tried to memcpy the data row for row as well, like this:

uint32_t sharedMemPtr = 0;
uint32_t imagePtr = 0;
for (uint32_t i = 0; i < gpuImage.rows; ++i)
{
    cudaMemcpy(inferenceSharedMemoryPtr + sharedMemPtr, (char*)gpuImage.data + imagePtr, width, cudaMemcpyDeviceToDevice);
    imagePtr += pitch;
    sharedMemPtr += width;
}

(I did not copy paste this so there can be a minor mistake).

But that did not help as well. When I download the data of that memory address in a GpuMat, I get a nice grey with black stripes image.

I just don’t know how to convert / copy the image on the gpu to the correct format (so from not continuous to continuous, like what happens when you download / upload the image from / to the gpu memory with opencv). I thought with cudaMemcpy2D it would work if you give another pitch to the destination (which is ‘width’ in the above example), but that does not seem to work.

Your code seems to assume that the rows of both source and destination matrices are stored at fixed-sized increments. In other words, it assumes strided storage (pitch is a very similar concept, except the distance between rows is expressed in bytes rather than elements). If GpuMat uses a pointer-to-array-of-row-pointers representation (note I am not saying it does; this is something you need to find out from documentation or source code), this will not work in general, because each row could reside at an arbitrary address. You would need to extract each row pointer, then copy the row it points to. E.g. for a 2D matrix implemented as a pointer to array of row pointers, you’d have something like this:

    float **matrix;
    for (int i = 0; i < rowCount; i++) {
        float *rowPtr = matrix[i];
        ... copy row ...
    }

It may be helpful to draw diagrams of how the data is arranged in memory, e.g. boxes for contiguous chunks of memory, arrows for pointers, etc.

If both source and destination storage are strided /pitched, with different row spacing for source and destination, then cudaMemcpy2D() is the correct tool, as that is what it is designed for.

Thanks for answering, I really appreciate that.

I think my assumption was wrong indeed. And I am afraid I have to found out what GpuMat exactly does under the hood. Because your code did not work either. There is not really documentation about it. I would say if I can download and upload the image and it works, there must be a way to copy it and keeping it on the gpu as well.

I’ll dive into the GpuMat. Thanks.

Since your matrix is a 3D matrix, it would likely use a three-level tree instead of the two-level tree I described for the 2D matrix. Or maybe the pointers refer to columns instead of rows.

Ah it turned out that somewhere in the pre processing, the image was made contiguous. So now we had a cv::GpuMat that was contiguous in memory on the Gpu that was not contiguous (created with cudaMalloc3D). So of course, that’s a bad combination.

So it turns out that copying cv::GpuMat with cudaMemcpy2D works ok. You should use the step from GpuMat as the source pitch value, or the pitch value from the cudaMalloc3D / cudaMallocPitch call. Destination pitch should be the width of the image (because there is no additional spacing in a continuous image). I assume width and height are clear. src can be gpuMat.data or gpuMat.ptr(0), as far as I can see.

About the cudaMalloc3D and cudaMemcpy2D: I found out the memory could also be created with cudaMallocPitch, we used a depth of 1, so it is working with cudaMemcpy2D.

Thanks for your help anyway!!

Ideally all matrices would be stored in one contiguous block on both host and device, as that allows the use of plain cudaMemcpy() which is typically significantly faster than cudaMemcpy2D().

Ah ok, I will look if that is possible.
I thought some things were processing faster on the gpu with not contiguous memory, but I look at it. Thanks for your advice!