Optical Flow: Why don't UploadData and DownloadData handle case where source is GPU?

My application gives me a cudaArray already on the device, so the source is CU_MEMORYTYPE_DEVICE (not CU_MEMORYTYPE_HOST), and I want to upload it to a buffer for optical flow processing.

I looked at the implementations of NvOFBufferCudaDevicePtr::UploadData and NvOFBufferCudaArray::UploadData but neither handles the case where cuCopy2d.srcMemoryType = CU_MEMORYTYPE_DEVICE;.

Why is this? I added my own variation of these methods, but I am wondering why it’s not already a part of the SDK. Thanks.

@mandar_godse Could you help me out with converting S10.5 to float2 but on the device?

I modified DownloadData like I mentioned into CustomDownloadData:

void NvOFBufferCudaArray::CustomDownloadData(void* pData) {
CUstream stream = m_NvOFAPI->GetCudaStream(getBufferUsage());
CUDA_DRVAPI_CALL(cuCtxPushCurrent(m_cuContext));
CUDA_MEMCPY2D cuCopy2d;
memset(&cuCopy2d, 0, sizeof(cuCopy2d));
cuCopy2d.WidthInBytes = getWidth() * getElementSize();

cuCopy2d.dstMemoryType = CU_MEMORYTYPE_ARRAY; // this used to be CU_MEMORYTYPE_HOST !
cuCopy2d.dstArray = (CUarray) pData; // added this line instead of cuCopy2d.dstHost
cuCopy2d.dstPitch = cuCopy2d.WidthInBytes;
cuCopy2d.srcMemoryType = CU_MEMORYTYPE_ARRAY;
cuCopy2d.srcArray = getCudaArray();
cuCopy2d.Height = getBufferFormat() == NV_OF_BUFFER_FORMAT_NV12 ? (getHeight() + getHeight() / 2) : getHeight();
CUDA_DRVAPI_CALL(cuMemcpy2DAsync(&cuCopy2d, stream));
if (getBufferFormat() == NV_OF_BUFFER_FORMAT_NV12)
{
    cuCopy2d.Height = (getHeight() + 1) / 2;
    cuCopy2d.dstHost = ((uint8_t*)pData + (cuCopy2d.dstPitch * cuCopy2d.Height));
    cuCopy2d.srcY = m_strideInfo.strideInfo[0].strideYInBytes;
    CUDA_DRVAPI_CALL(cuMemcpy2DAsync(&cuCopy2d, stream));
}
CUDA_DRVAPI_CALL(cuStreamSynchronize(stream));
CUDA_DRVAPI_CALL(cuCtxPopCurrent(&m_cuContext));
}

Then:

cudaArray* intermediateArray;	
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(16, 16, 0, 0, cudaChannelFormatKindUnsigned);
cudaError_t cudaStatus = cudaMallocArray(&intermediateArray, &channelDesc, Width, Height, cudaArraySurfaceLoadStore);
upsampleBuffers[0]->CustomDownloadData(intermediateArray);
doCUDAOperation(outputFormat->width, outputFormat->height, intermediateArray, outputArray);

But now, no matter kernel I come up with, I keep observing something weird about the data directly read from intermediateArray whether through tex2D or surf2Dread. 16-bits should be the flow_x channel and the following 16-bits should be the flow_y channel. For each of these channels, I’m seeing that the MSB is correct but is propagated to the upper 8 bits. The lower 8-bits of each channel might be correct. What could explain this?

I’m following Generates flow vectors in X and Y direction in S10.5 format at quarter pixel precision for 8-bit contents. The vectors are provided for 4x4 grid size.