Input Output Buffer formatting

I’m trying to inference a YoloV2 network.
With input shape (1,608,608,3) and output shape (1, 19, 19, 5, 6). NHWC formatting, there are sufficient transpose layers on in/out for this.

I am preparing the image buffer as follows: Where device buffer is a void*[2] passed to the function as void* (&buffer)[2]. Note the input mat is a RGB not BGR formatted mat.

    cv::cuda::GpuMat flMat;
    cv::cuda::resize(irImg, irImg, cv::Size(608,608), 0, 0, cv::INTER_CUBIC);
    irImg.convertTo(flMat, CV_32FC3);
    cv::cuda::divide(flMat, 255.0, flMat);
     float* devicePtr;
    cudaMalloc(&devicePtr, bufferSize[0]);
    cv::cuda::GpuMat deviceMat(mat.rows, mat.cols, CV_32FC3, devicePtr);
    mat.copyTo(deviceMat);
    cudaError_t err = cudaMemcpy(deviceBuffer[0], devicePtr, bufferSize[0], cudaMemcpyDeviceToDevice);
    if (err != cudaError::cudaSuccess) {
        DLOG(INFO) << "Buffer load failed: " << err;
        return false;
    }
    cudaFree(devicePtr);

I think execute inference using the void*[2] in/out buffer and attempt to decode the output using a 3D for loop over the 19,19,5 dims.

The indexing for the output is calculated using, where i is the 5th dim on the output shape above:

dim.d[1] * (dim.d[2]*(dim.d[3]*r+c)+b)+i

With the loop looking like:

for (int r = 0; r < dataDim.d[1]; r++) {
for (int c = 0; c < dataDim.d[2]; c++) {
for (int b = 0; b < dataDim.d[3]; b++) {
float tp = data[calculateIdx(r,c,b,4,dataDim)];
float prob = sigmoid(tp);
if (prob < 0.5f) { continue; }

The decoded output is completely incorrect, almost all grids return sufficient detection probability.
Any thoughts would be appreciated