Padding in OpenCL


I am a beginner in GPU programming. I tried CUDA for a few weeks and now I am trying OpenCL.
In CUDA I used the functions cudaMemcpy2D, cudaMemset2D and cudaMallocPitch to have data aligned, and it signifcantly improved performance on 8800GT, less on Fermi (because of cache ?).

My problem is that I cannot find any similar functions in OpenCL.
Would you have any idea of how I can do the same thing ?

Any help would be appreciated.
Best regards.

I would bet(hope) that when you allocate a buffer or an image in OpenCL that the driver will align the buffer to the best possible boundry. Internal to buffers you are free to align data how you want, so just allocate a larger buffer and add some bytes to the end of each row etc?


I forgot to add that my arrays are multidimensional arrays in my algorithms (signal processing).

So if I allocate using clCreateBuffer (dim1xdim2x…xdimn), when I do dim2++, as dim1 is not padded, I will jumd to an address which is not coalesced.

For example, if I want to allocate an array - float ar[10][5]:

  1. I use cudaMalloc([…]105sizeof(float)) - same with clCreateBuffer:

=> this command will allocate 200 bytes.

Each time I do “ar[i++]”, I do an address jump of 20 bytes (5 floats), which is not coalesced.

  1. I use cudaMallocPitch([…] 5*sizeof(float), 10)

=> on Fermi, it would create a buffer of 5120 bytes, corresponding to ar[10][5*sizeof(float) + pad = 512 bytes].

In this case, each time I do “ar[i++]”, I do an address jump of 512 bytes, which is coalesced.

The second method uses more memory space, but is a lot more efficient on 8800GT.

So you need to pad each dimension yourself? (depending on how the kernel is executed)

Ok for array allocation.

But what about communications ? (cudaMemcpy2D for example)

How can I only transfer the part of the array I need, and not the complete array with useless padding ?

Have you measured any significant difference due to transfering the padding as well? Shouldnt the padding be relativly small and so just issuing one large memcpy, instead of lots of smaller ones, be more efficient?

(maybe GPUs have something smart for handling this case, since it is perhaps quite common for images and vertex data, but I would be a little bit surprised).

[Actually looking at the above posts, the amount of padding seems very largein the example. Is it possible to use less padding and still get good performance?]


If NVIDIA had released their OpenCL 1.1 conformant drivers publically, I’d tell you to take a look at the clEnqueue*BufferRect() set of commands, which addresses the issue of pitched array regions (which is what you want from what I understood from your posts). However, this wont work with the latest released public drivers, as they are OpenCL 1.0, and not 1.1.

This is exactly what I need, thank you !

As they added it to the 1.1 specifications, I think I am not the only one with this need…

So I have to wait for the OpenCL 1.1 from NVidia.

I will have a try on AMD cards until they come out.

Thanks again.