I am a beginner in GPU programming. I tried CUDA for a few weeks and now I am trying OpenCL.
In CUDA I used the functions cudaMemcpy2D, cudaMemset2D and cudaMallocPitch to have data aligned, and it signifcantly improved performance on 8800GT, less on Fermi (because of cache ?).
My problem is that I cannot find any similar functions in OpenCL.
Would you have any idea of how I can do the same thing ?
I would bet(hope) that when you allocate a buffer or an image in OpenCL that the driver will align the buffer to the best possible boundry. Internal to buffers you are free to align data how you want, so just allocate a larger buffer and add some bytes to the end of each row etc?
Have you measured any significant difference due to transfering the padding as well? Shouldnt the padding be relativly small and so just issuing one large memcpy, instead of lots of smaller ones, be more efficient?
(maybe GPUs have something smart for handling this case, since it is perhaps quite common for images and vertex data, but I would be a little bit surprised).
[Actually looking at the above posts, the amount of padding seems very largein the example. Is it possible to use less padding and still get good performance?]
If NVIDIA had released their OpenCL 1.1 conformant drivers publically, I’d tell you to take a look at the clEnqueue*BufferRect() set of commands, which addresses the issue of pitched array regions (which is what you want from what I understood from your posts). However, this wont work with the latest released public drivers, as they are OpenCL 1.0, and not 1.1.