I am struggling with a GpuMat conversion to the Triton Inference Server. I want to copy data of a GpuMat to the shared memory of the inference server.
The image of this example is a 600 * 600 * 3 floating point image.
I first tried with a cv::Mat, that works well.
cv::Mat image;
// Do things with image
size_t inputSize = image.total() * image*elemSize(); // 360000 * 12 = 4320000
cudaError error = cudaMemcpy((void*)(inferenceSharedMemoryPtr, image.ptr(0), inputSize, cudaMemcpyHostToDevice);
Now, we process the image on the gpu. We allocate memory on the gpu first with cudaMalloc3D and then use the pointer to the allocated memory to create an GpuMat. The cudaExtent we use for cudaMalloc3D has a depth of 1, a height of the image height and a width of image width * 3 * sizeof(float)
(three ‘colors’ and the values are converted to floats, because that’s what the inference server expects).
The GpuMat is created like:
cv::cuda::GpuMat gpuImage(height, width * 3, CV_32FC1, sharedMemoryPointer, pitch)
Where the pitch is retrieved from the cudaMalloc3D call.
Height is 600, width is 7200 (600 * 3 * sizeof(float)), pitch is 7680. Shared memory pointer is the pointer returned from the cudaMalloc3D call.
Then, we want to memcpy the data from the GpuMat to the shared memory of the Triton Inference Server. The Inference Server expects continuous data and of course, the GpuMat is not. So the question is: how to memcpy the data so it becomes continuous and the Inference Server can use it?
I tried with
cudaError error = cudaMemcpy2D((void*)inferenceSharedMemoryPtr, width, sharedMemoryPointer, pitch, width, height, cudaMemcpyDeviceToDevice);
Where width is 7200 (600 * 3 * sizeof(float)), height is 600, pitch is 7680, inferenceSharedMemoryPointer is the pointer on the inference server to the shared memory and sharedMemoryPointer is the pointer from the cudaMalloc3D call.
I get results from the inference server, but they are far from correct. When doing a memcpy of the same image from ‘normal memory’ to the inference server (so from a cv::Mat as in the first example), everything works well.
So I think the cudamemcpy is not going well. It might just copy the wrong data. I don’t know if that should work but after doing a cudamemcpy, I created a GpuMat pointing to the just copied data on the Inference Server, downloaded it, but it is all black. But I am not sure if that should work anyway.
and what also does not work (but seems logical to do because of the pointer to the image data):
cudaError error = cudaMemcpy2D((void*)inferenceSharedMemoryPtr, width, gpuImage.data, pitch, width, height, cudaMemcpyDeviceToDevice);
Where width is 7200 (600 * 3 * sizeof(float)), height is 600, pitch is 7680, inferenceSharedMemoryPointer is the pointer on the inference server to the shared memory.
The cudaMemCpy2D call results in an error: cudaErrorInvalidValue. Same with gpuImage.ptr(0)
.
Does anyone know how to do this?
Thanks in advance!
PS I don’t know if it’s a useful hint but when I download and upload the image with:
cv::cuda::GpuMat test(600, 600, CV_32FC3, (char*)gpuImage.data); // I know this is a bit strange but at that point in the code I only have a pointer to the shared memory of the gpumat and not the gpu mat itself.
cv::Mat downloaded;
test.download(downloaded);
cudaMemcpy((void*)inferenceSharedMemoryPtr, test.ptr(0), inputSize, cudaMemcpyHostToDevice); // inputSize = 360000 * 12 = 4320000
then it works like expected. But that’s of course not what I want: I want to keep everything on the GPU.