OpenCL vs CUDA: Pinned memory

Dear community,

For my graduation project, I am comparing the performance of the RabbitCT benchmark between CUDA and OpenCL on a GPU and OpenCL on an FPGA. At the moment, I have created a CUDA RabbitCT implementation using pinned memory and I am trying to do the same for the OpenCL implementation. However, this OpenCL implementation uses twice the device memory when compared to CUDA and I have been unable to resolve this without other issues.

For CUDA I create a host buffer with the input images to be processed. This allows the host to issue all kernel calls and load the next set of input images while the kernel processes the images in the initial buffer. A mockup can be seen in the following code:

// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];

// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
    gpuErrchk(cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking));
    gpuErrchk(cudaMalloc( (void**)&devProjection[s], imgSize));
}

// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
    cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}

// main loop
for( all input images ){
    uint projNr = imgnr % BUFFER_SIZE;
    uint streamNr = r->adv_projNumber % STREAMS_MAX;
 	
    // When buffer is filled, wait until work in current stream has finished before refilling buffer
    if(projNr == 0)	{
        gpuErrchk(cudaStreamSynchronize(stream[streamNr]));
    }       
        
    // copy received image data to buffer
    std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
        
    // copy image and matrix to device
    cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );

    // call kernel
    backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devData, devProjection[streamNr], devMatrix[streamNr], devVolume);
}

So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.

For OpenCL I do something similar following the Nvidia OpenCL Best Practices Guide:

// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX];
cl_mem devProjectionPinned[BUFFER_SIZE];
cl_command_queue queue[STREAMS_MAX];

// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
    queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
    devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY, imgSize, NULL, &status);
}

// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
    devProjectionPinned[b] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
    hostProjBuffer[b] = (float*) clEnqueueMapBuffer(queue[0], devProjectionPinned[b], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
}

// main loop
for( all input images ){
    const uint projNr = imgNr % BUFFER_SIZE;
    const uint streamNr = imgNr % STREAMS_MAX;

    // When buffer is filled, wait until work in current stream has finished before refilling buffer
    if(projNr == 0) {
       status = clFinish(queue[streamNr]);
    }

    // copy received image data to hostbuffers
    std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);

    // copy image and matrix to device
    clEnqueueWriteBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, 0, imgSize, hostProjBuffer[projNr], 0, NULL, NULL);

    // set stream specific arguments
    clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);

    // launch kernel
    clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);

    clFlush(queue[streamNr]);
}

So, with OpenCL a cl_mem pinned memory buffer is made, to which a host address is mapped. This host address is used as buffer and copied to the kernels input buffer before executing the kernel.

Both codes work without any issues and a similar execution speed, however, the OpenCL implementation uses twice the device memory according to nvidia-smi. This is caused by having to initialise a pinned device buffer, as now the entire buffer resides on the device as well due to the memory mapping. This is not the case with CUDA, which likely maps the pinned host memory to the destination before copying the data.

A similar code to this in OpenCL is as follows:

// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];

// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
    queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
    devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}

// main loop
for( all input images ){
    const uint projNr = imgNr % BUFFER_SIZE;
    const uint streamNr = imgNr % STREAMS_MAX;

    // when buffer is filled, wait until work in current stream has finished
    if(projNr == 0) {
       status = clFinish(queue[streamNr]);
    }

    // map host memory region to device buffer
    hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);

    // copy received image data to hostbuffers
    std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);

    // unmap the allocated pinned host memory
    clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);	

    // set stream specific arguments
    clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);

    // launch kernel
    clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);

    clFlush(queue[streamNr]);
    clFinish(queue[streamNr]);   //should be removed!
}

This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates that the host gets to far ahead of the kernel, which was not the case with the 2nd example. I tried initializing an equal number of hostBuffers as input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.

So, with these three examples and the goal to write code as similar to eachother as possible for the graduation project, I have several questions:

  1. What is the recommended implementation for OpenCL pinned memory?
  2. For the first OpenCL implementation: why does OpenCL use double the memory when compared to CUDA?
  3. For the second OpenCL implementation: is this similar to how CUDA handles pinned memory?
  4. For the second OpenCL implementation: what causes wrong data to be used in the kernel?

If you need more information to answer my questions, just ask ;)

Thanks in advance!

Kind regards,
Remy