Image2D objects in OpenCL and OpenCL kernel performance

There are two alternatives to creating and populating an image object (texture) in OpenCL: a) Setting the CL_MEM_COPY_HOST_PTR flag in clCreateImage2D() or B) using the clEnqueueWriteImage() API.
texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, &imageFormat, imageWidth, imageHeight, 0, inputData, &err);

or

texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY, &imageFormat, imageWidth, imageHeight, 0, 0, &err);
size_t size3D[3] = {imageWidth, imageHeight,1};
size_t size3DOrig[3] = {0, 0, 0};
err = EnqueueWriteImage(commandQueue, texImage, CL_TRUE, size3DOrig, size3D, 0, 0, inputData, 0, NULL, NULL);

Using NVIDIA’s OpenCL implementation, for the second alternative, the time to create and populate texture is similar to that of CUDA, while the first is 6 times slower? Also, the second alternative leads to atleast 3 times faster access to the texture data within the OpenCL kernel as compared to the first. Any idea why? Any differences in the locality?

Also, in general, OpenCL kernel performance is 2-3 times slower than a CUDA kernel? Is this due to some overheads?

I’m not sure on this one but don’t you still need to clWriteImage even after creating the buffer with CL_MEM_COPY_HOST_PTR? I thought CL_MEM_COPY_HOST_PTR only means that OpenCL will make a local copy of the data on the host so that the original pointer can be immediately reused or modified.

The specs are not clear on this:

It doesn’t say whether it allocates on device or hosts.

EDIT:

I just tested in on the AMD implementation (don’t have an NVIDIa GPU on my laptop) and a buffer created with CL_MEM_COPY_HOST_PTR indeed doesn’t require an explicit clWrite. I’m at loss to what CL_MEM_ALLOC_HOST_PTR should do though.

No idea why it would be slower.

Thanks for the heads up. I have only used the 2nd method. Fortunately, you do not find a mix where 1 loads faster, but the other one has the faster access.

The difference might be when specifying data to load on create, the procedure needs to take into account that there may be multiple devices to load data to. Doing so, might inadvertently slow it down, even when there is only 1 device. When loading data to a specific command queue, no such checking is needed.

Hopefully, someone from NVIDIA picks up on this, and has a look around.