performance question

Hi, I have some questions about the performance of OpenCL kernels.
I’m trying to implement kernel which operates on RGB 16-bit images
RGB16 can’t be done in single image (RGBA16 is the only option) and I don’t want to waste memory for A channel.
Hence I tried 2 variants for the input image:

  1. use 3 images (clCreateImage2d) with single 16-bit channel for R,G and B
  2. use 3 buffers with (clCreateBuffer) with 16-bit values for R,G,B

Output buffer is PBO.

My hardware is Palit Geforce GTX 250 E-Green (512 Gb GDDR3, nv clock 675 Mhz, memory clock 900 Mhz). I’m using Gentoo linux (nvidia-drivers version is 256.35)
I’m testing my kernels with 14 MPx image (4608x3072)

I’m getting following numbers (measured with differents between 2 getlocaltime outputs, according to pseudocode: "clFinish(); getlocaltime(…); clEnqueueNDRange(…); clFinish(); getlocaltime(…); )

  1. 3 images variant
    localworksize time(sec):
    16x1 0.00864
    32x1 0.00661
    64x1 0.00429
    128x1 0.00586
    256x1 0.00591
    512x1 0.00603
    16x2 0.00476
    16x4 0.00370
    16x8 0.00350
    16x16 0.00375
    16x32 0.00528

  2. 3 buffers variant (1d buffers):
    localworksize time(sec)
    16 0.03752
    32 0.03556
    64 0.03097
    128 0.03215
    256 0.03336
    512 0.03556

Questions (see opencl code sources below):

  1. why 16x8 is the fastest option?
  2. why variant with buffers works almost 10 times slower?
  3. are these times normal?
  4. Can per-pixel computations for 10+ Mpx image be done faster?

Opencl sources:

helper function:

unsigned int convert_output(float r, float g, float b )
{
return ((unsigned int)(r255.0) & 0x000000FF) |
(((unsigned int)(g
255.0) << 8) & 0x0000FF00) |
(((unsigned int)(b*255.0) << 16) & 0x00FF0000);
}

image variant:

__kernel void gammaCorrect(
__read_only image2d_t srcR,
__read_only image2d_t srcG,
__read_only image2d_t srcB,
__global unsigned int* dst, sampler_t sampler, float gamma)
{
int2 pos = { get_global_id(0), get_global_id(1) };
float4 r1 = read_imagef(srcR, sampler, pos);
float4 g1 = read_imagef(srcG, sampler, pos);
float4 b1 = read_imagef(srcB, sampler, pos);
float r = native_powr(r1.x,gamma);
float g = native_powr(g1.x,gamma);
float b = native_powr(b1.x,gamma);
dst[mul24(pos.y,get_global_size(0)) + pos.x] = convert_output(r,g,b );
}

buffer variant:

__kernel void gamma(
__global ushort* srcR,
__global ushort* srcG,
__global ushort* srcB,
__global unsigned int* dst, float gamma)
{
float r = native_powr((float)srcR[get_global_id(0)] / 65536.0,gamma);
float g = native_powr((float)srcG[get_global_id(0)] / 65536.0,gamma);
float b = native_powr((float)srcB[get_global_id(0)] / 65536.0,gamma);
dst[get_global_id(0)] = convert_output(r,g,b );
}

Well… I 'll try my best with these…

  1. the 16X8 relates to the internal architecture of your NVidia card, it implicates the number of SP units ready to be used. This actually crates about 4 warps and I guess the cache is hit a lot in this configuration and this is why the performance gain. Keep in mind that the best caching in a gfx card is done in 2D…

  2. The buffers are a mechanism which needs to be traversed , as it is linear. Remember the caches issue from the last section ? it also applies here. HEaving traversed a 2D picture from a 1D buffer costs a lot of cache misses…

  3. Beats me… seems like the 16X8 gives a great result for a picture of this magnitude.

  4. There are always optimizations you can make. I can’t dive in to your code right now, but you sure can consider kernel usage on the architecture, Local memory sharing , better cache utilization and far more. There is a paper by NVidia called “Fast_Texture_Transfers” and gives out great tips as for working with textures and other side tips. Look it up.

Good Luck.

E.

thank you very much for the reply.
hmm. So this means that even the best coalesced access to global memory is much slower than access to cached texture memory, right?
Am I correct in following:
in case of processing those images all the data is on gpu global memory and is cached in very fast texture memory?
(I use CL_MEM_COPY_HOST_PTR when creating buffers and images)

Division by double is also expensive.

Where is division by double there? it is just plain float division, am I wrong?

I tried to test without division and the computation time with buffers was basically the same (difference was within statistic error).

get_global_id(0)] / 65536.0

That’s division by a double. 65536.0 is a double, 65536.0f would be a float.

Then again, your card doesn’t support doubles and I’m pretty sure the compiler just transforms such declarations into floats.

Not that simple. If the data happens to be in the cache, it’s much faster. But the cache is so tiny, this will rarely be the case. Where the texture cache is really useful is making some “almost coalesced” access patterns behave as coalesced. Example: you would access global memory by global_array[get_global_id(0)+1] which is almost coalesced (work-items read consecutive addresses, nicely ordered, etc.) but breaks the alignment rule. When you use texture instead of a global array, the cache will be able to hide this little misalignment.

However, this won’t be faster than a coalesced access to global memory. Making a coalesced access is the fastest way to get lots of stuff to/from the card’s RAM. Images/textures are also physically stored in RAM and the cache won’t magically increase the bandwidth, as I’m sure you realize. It would decrease memory workload if enough requests could be handled within the cache, but it’s just too small. What you’re doing in these kernels is completely bandwidth bound anyway with no re-use of image memory. You should be seeing no benefit from images there.

Here’s an idea. Your image is 4608x3072, which gives 14155776 elements. If you use 1d NDRanges, as you seem to since your addressing is by get_global_id(0), you launch either

16 threads per block, 884736 blocks

32 threads, 442368 blocks

64 threads, 221184 blocks

128 threads , 110592 blocks

256 threads, 55296 blocks

512 threads, 27648 blocks

The maximum number of blocks you can have in any single dimension is 65535, meaning anything less than 256 threads per block is technically illegal (shouldn’t launch). You’re busting the dimension range.

Do you check for errors?

Do you launch the kernels in blocking mode?

hmmm, I thought compiler should be clever enough not to use doubles here. Result is float and other operand is also float.

Anyway its not the reason. I tried both removing division and adding f suffixes and the time was the same.

Yes, I execute kernel in blocking mode.

I put the image to the buffer like this:

clEnqueueWriteBuffer(opencl_context->command_queue, opencl_inputR, CL_TRUE, 0, image_width * image_height * sizeof(uint16_t), inputR, 0, NULL, NULL);

this is done once.

Then I launch kernel many times (each time gamma changes from keyboard +/-)

this is how I launch the kernel.

clFinish(opencl_context->command_queue);

	usec_timer_reset(compute_timer);

	if ((err = clEnqueueNDRangeKernel(opencl_context->command_queue, opencl_kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL)) != CL_SUCCESS)

	{

		printf("ERROR: %d\n",err);

		cleanup();

	}

	clFinish(opencl_context->command_queue);

	compute_time = usec_timer_get(compute_timer);

Also I check errors and it works.

Also I see the output correctly displayed on the screen anyway (lightens and darkens when gamma changes).

Hence I believe it works.

You might also consider calling clCreateBuffer with CL_MEM_COPY_HOST_PTR or CL_MEM_USE_HOST_PTR so you don’t need to Enqueue the write buffer / read buffer. This should be faster. See: http://www.khronos.org/registry/cl/sdk/1.0…eateBuffer.html

I recommend using the OpenCL event here to ask the OpenCL for the profiling info for more precise information on the time the GPU actually took to calculate your results.

http://www.khronos.org/registry/cl/sdk/1.0…filingInfo.html and http://www.khronos.org/registry/cl/sdk/1.0…angeKernel.html have more information.

Basically, you want something more like:

compute_time = -1; // error condition

	cl_event event; // Don't forget to release the event later, or bad things _will_ happen...

	cl_ulong start_time = 0, end_time = 0;

	if ((err = clEnqueueNDRangeKernel(opencl_context->command_queue, opencl_kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, &event)) != CL_SUCCESS)

	{

		printf("ERROR: %d\n",err);

		cleanup();

	}

	else if( err = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_START, sizeof( cl_ulong ), &start_time, NULL ) != CL_SUCCESS )

	{

			printf("ERROR: %d\n",err);

			cleanup();

	}

	else if( err = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_END, sizeof( cl_ulong ), &end_time, NULL ) != CL_SUCCESS )

	{

			printf("ERROR: %d\n",err);

			cleanup();

	}

	clFinish(opencl_context->command_queue);

	compute_time = end_time - start_time; // in nanoseconds

Note that you will need to add the flag CL_QUEUE_PROFILING_ENABLE to your call to clCreateCommandQueue ( http://www.khronos.org/registry/cl/sdk/1.0…mmandQueue.html ).

(side note)

Some of these OpenCL APIs that I am suggesting that you use do not work equally on all OpenCL vendors implementations or all platforms or all device drivers. The list of OpenCL vendors is growing, and it is not cheap or feasible for everyone to test their code on all combinations of OS, device, and driver. I have personally found NVIDIA’s OpenCL implementation to be the best of those that I have tested, but there are still differences in the bugs in different versions of the drivers and against different devices. Make sure that your drivers are up to date and that you test on older drivers, too. Sometimes driver updates break code - using beta drivers or the latest drivers before your end users / “customers” do is the best way to safeguard your program from being broken.

(/side note)

Good luck!

-Mike

I tried this too. I think it shouldn’t matter since I do copy data to the buffer once and then use it many times.

What I don’t undesrtand is why bandwidth can be limiting here, If all data is on gpu.

And if its not the bandwidth - why buffers are 10 times slower than image?

this measurement produces same numbers.

I used beta drivers latest available at the moment.

Now they have update, so I try it to see if it helps.

I’m implementing proof-of-concept prototype now.

Probably I’m going to stick with images approach, but I’m not sure it will be the same for other platforms.

Currently I have access only to this nvidia card I mentioned in the first message.

Anyway thanks for help.

It looks like nvidia opencl implementation transfers all the data from host to gpu and back.

with measured bandwidth around 2.2GB/s it should take around the time I got with buffers to copy data from host to gpu.

I also tried even to do the following:

create input buffers with CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR

create output buffers with CL_MEM_READ_WRITE

run kernel which just copies data from input to output

run old kernel which takes data from output buffers, applies a filter to it and copies to pbo as before

kernel which copies from input to output:

__kernel void copy(

	__global ushort* srcR, 

	__global ushort* srcG, 

	__global ushort* srcB, 

	__global ushort* dstR, 

	__global ushort* dstG, 

	__global ushort* dstB)

{

	dstR[get_global_id(0)] = srcR[get_global_id(0)];

	dstG[get_global_id(0)] = srcG[get_global_id(0)];

	dstB[get_global_id(0)] = srcB[get_global_id(0)];

}

and the result was the same.

as if output buffers where synced back to host after first kernel

and then synced to gpu with second kernel.

is there any way to find out whether this copy happens?