I have a simple question for you, it’s only about performances.
I have a buffer b0 that I have to read (bank conflict) to generate another buffer b1 (coalesced writes).
Then I read (bank conflict) the buffer b1 to generate b2 (coalesced writes).
I try to achieve the maximum performance (like everybody i guess) so I use the image2d which is supposed to be less penalized if bank conflicts appears.
What I do is:
Put b0 in an “read only” image2d.
Generate b1 (buffer) from b0.
Copy the buffer b1 into a “read only” image2d b1_bis.
Generate b2 (buffer) from b1_bis.
I don’t like the copy part and I was thinking about this pseudo-code:
Put b0 in an “read only” image2d.
Generate b1 (read-write image2d) from b0.
Generate b2 (buffer) from b1.
Do you think that the performance of a read-write image2d is lower or faster than “read-only image2d” + “memory copy on the device”+ “extra allocation for the intermediate buffer”.
I guess I should try to be sure.
But if somebody tells me that a read-write image has the same perf in coalesced writing than a buffer and the same perf on reading than a read-only image, I already have my answer :)
Note that “a kernel cannot read from and write to the same image object”, which means you’ll need separate kernel invocations for 1 → 2 and 2 → 3. If you need separate kernel invocations anyway, I’d just swap the kernel arguments to the image objects. This is how I’m doing it basically (both output_image0 and output_image1 are created with CL_MEM_READ_WRITE, obviously):
And to answer your question, yes I believe this is more efficient than “read-only image2d” + “memory copy on the device” + “extra allocation for the intermediate buffer”. I haven’t done any benchmarking though, probably because this solution seemed to be the most natural approach in terms of performance to me.