Question on performance of image strctures

Hi,

I have a simple question for you, it’s only about performances.

I have a buffer b0 that I have to read (bank conflict) to generate another buffer b1 (coalesced writes).
Then I read (bank conflict) the buffer b1 to generate b2 (coalesced writes).
I try to achieve the maximum performance (like everybody i guess) so I use the image2d which is supposed to be less penalized if bank conflicts appears.
What I do is:

  1. Put b0 in an “read only” image2d.
  2. Generate b1 (buffer) from b0.
  3. Copy the buffer b1 into a “read only” image2d b1_bis.
  4. Generate b2 (buffer) from b1_bis.

I don’t like the copy part and I was thinking about this pseudo-code:

  1. Put b0 in an “read only” image2d.
  2. Generate b1 (read-write image2d) from b0.
  3. Generate b2 (buffer) from b1.

Do you think that the performance of a read-write image2d is lower or faster than “read-only image2d” + “memory copy on the device”+ “extra allocation for the intermediate buffer”.
I guess I should try to be sure.
But if somebody tells me that a read-write image has the same perf in coalesced writing than a buffer and the same perf on reading than a read-only image, I already have my answer :)

Thanks,

Vincent

Note that “a kernel cannot read from and write to the same image object”, which means you’ll need separate kernel invocations for 1 -> 2 and 2 -> 3. If you need separate kernel invocations anyway, I’d just swap the kernel arguments to the image objects. This is how I’m doing it basically (both output_image0 and output_image1 are created with CL_MEM_READ_WRITE, obviously):

cl::Image2D* accum_image=NULL;

cl::Image2D* output_image=NULL;

cl::NDRange range_global(input_dims.i,input_dims.j);

cl::NDRange range_local(portWorkSize.getValue(0),portWorkSize.getValue(1));

for (int slice=0;slice<input_dims.k;++slice) {

    if ((slice&1)==0) {

        accum_image=&output_image0;

        output_image=&output_image1;

    }

    else {

        accum_image=&output_image1;

        output_image=&output_image0;

    }

kernel.setArg(0,slice);

    kernel.setArg(2,*accum_image);

    kernel.setArg(3,*output_image);

cl::KernelFunctor func=kernel.bind(*m_queue,range_global,range_local);

    result=func().wait();

}

And to answer your question, yes I believe this is more efficient than “read-only image2d” + “memory copy on the device” + “extra allocation for the intermediate buffer”. I haven’t done any benchmarking though, probably because this solution seemed to be the most natural approach in terms of performance to me.