performance question

rip_someday · August 2, 2010, 12:32am

Hi, I have some questions about the performance of OpenCL kernels.
I’m trying to implement kernel which operates on RGB 16-bit images
RGB16 can’t be done in single image (RGBA16 is the only option) and I don’t want to waste memory for A channel.
Hence I tried 2 variants for the input image:

use 3 images (clCreateImage2d) with single 16-bit channel for R,G and B
use 3 buffers with (clCreateBuffer) with 16-bit values for R,G,B

Output buffer is PBO.

My hardware is Palit Geforce GTX 250 E-Green (512 Gb GDDR3, nv clock 675 Mhz, memory clock 900 Mhz). I’m using Gentoo linux (nvidia-drivers version is 256.35)
I’m testing my kernels with 14 MPx image (4608x3072)

I’m getting following numbers (measured with differents between 2 getlocaltime outputs, according to pseudocode: "clFinish(); getlocaltime(…); clEnqueueNDRange(…); clFinish(); getlocaltime(…); )

3 images variant
localworksize time(sec):
16x1 0.00864
32x1 0.00661
64x1 0.00429
128x1 0.00586
256x1 0.00591
512x1 0.00603
16x2 0.00476
16x4 0.00370
16x8 0.00350
16x16 0.00375
16x32 0.00528
3 buffers variant (1d buffers):
localworksize time(sec)
16 0.03752
32 0.03556
64 0.03097
128 0.03215
256 0.03336
512 0.03556

Questions (see opencl code sources below):

why 16x8 is the fastest option?
why variant with buffers works almost 10 times slower?
are these times normal?
Can per-pixel computations for 10+ Mpx image be done faster?

Opencl sources:

helper function:

unsigned int convert_output(float r, float g, float b )
{
return ((unsigned int)(r255.0) & 0x000000FF) |
(((unsigned int)(g255.0) << 8) & 0x0000FF00) |
(((unsigned int)(b*255.0) << 16) & 0x00FF0000);
}

image variant:

__kernel void gammaCorrect(
__read_only image2d_t srcR,
__read_only image2d_t srcG,
__read_only image2d_t srcB,
__global unsigned int* dst, sampler_t sampler, float gamma)
{
int2 pos = { get_global_id(0), get_global_id(1) };
float4 r1 = read_imagef(srcR, sampler, pos);
float4 g1 = read_imagef(srcG, sampler, pos);
float4 b1 = read_imagef(srcB, sampler, pos);
float r = native_powr(r1.x,gamma);
float g = native_powr(g1.x,gamma);
float b = native_powr(b1.x,gamma);
dst[mul24(pos.y,get_global_size(0)) + pos.x] = convert_output(r,g,b );
}

buffer variant:

__kernel void gamma(
__global ushort* srcR,
__global ushort* srcG,
__global ushort* srcB,
__global unsigned int* dst, float gamma)
{
float r = native_powr((float)srcR[get_global_id(0)] / 65536.0,gamma);
float g = native_powr((float)srcG[get_global_id(0)] / 65536.0,gamma);
float b = native_powr((float)srcB[get_global_id(0)] / 65536.0,gamma);
dst[get_global_id(0)] = convert_output(r,g,b );
}

EyalB · August 2, 2010, 9:39am

Well… I 'll try my best with these…

the 16X8 relates to the internal architecture of your NVidia card, it implicates the number of SP units ready to be used. This actually crates about 4 warps and I guess the cache is hit a lot in this configuration and this is why the performance gain. Keep in mind that the best caching in a gfx card is done in 2D…
The buffers are a mechanism which needs to be traversed , as it is linear. Remember the caches issue from the last section ? it also applies here. HEaving traversed a 2D picture from a 1D buffer costs a lot of cache misses…
Beats me… seems like the 16X8 gives a great result for a picture of this magnitude.
There are always optimizations you can make. I can’t dive in to your code right now, but you sure can consider kernel usage on the architecture, Local memory sharing , better cache utilization and far more. There is a paper by NVidia called “Fast_Texture_Transfers” and gives out great tips as for working with textures and other side tips. Look it up.

Good Luck.

E.

rip_someday · August 3, 2010, 8:26am

thank you very much for the reply.
hmm. So this means that even the best coalesced access to global memory is much slower than access to cached texture memory, right?
Am I correct in following:
in case of processing those images all the data is on gpu global memory and is cached in very fast texture memory?
(I use CL_MEM_COPY_HOST_PTR when creating buffers and images)

Lev · August 3, 2010, 8:37am

Division by double is also expensive.

rip_someday · August 3, 2010, 8:47am

Where is division by double there? it is just plain float division, am I wrong?

I tried to test without division and the computation time with buffers was basically the same (difference was within statistic error).

_Big_Mac · August 3, 2010, 12:42pm

get_global_id(0)] / 65536.0

That’s division by a double. 65536.0 is a double, 65536.0f would be a float.

Then again, your card doesn’t support doubles and I’m pretty sure the compiler just transforms such declarations into floats.

Not that simple. If the data happens to be in the cache, it’s much faster. But the cache is so tiny, this will rarely be the case. Where the texture cache is really useful is making some “almost coalesced” access patterns behave as coalesced. Example: you would access global memory by global_array[get_global_id(0)+1] which is almost coalesced (work-items read consecutive addresses, nicely ordered, etc.) but breaks the alignment rule. When you use texture instead of a global array, the cache will be able to hide this little misalignment.

However, this won’t be faster than a coalesced access to global memory. Making a coalesced access is the fastest way to get lots of stuff to/from the card’s RAM. Images/textures are also physically stored in RAM and the cache won’t magically increase the bandwidth, as I’m sure you realize. It would decrease memory workload if enough requests could be handled within the cache, but it’s just too small. What you’re doing in these kernels is completely bandwidth bound anyway with no re-use of image memory. You should be seeing no benefit from images there.

Here’s an idea. Your image is 4608x3072, which gives 14155776 elements. If you use 1d NDRanges, as you seem to since your addressing is by get_global_id(0), you launch either

16 threads per block, 884736 blocks

32 threads, 442368 blocks

64 threads, 221184 blocks

128 threads , 110592 blocks

256 threads, 55296 blocks

512 threads, 27648 blocks

The maximum number of blocks you can have in any single dimension is 65535, meaning anything less than 256 threads per block is technically illegal (shouldn’t launch). You’re busting the dimension range.

Do you check for errors?

Do you launch the kernels in blocking mode?

rip_someday · August 4, 2010, 9:15am

get_global_id(0)] / 65536.0
That’s division by a double. 65536.0 is a double, 65536.0f would be a float.

Then again, your card doesn’t support doubles and I’m pretty sure the compiler just transforms such declarations into floats.

hmmm, I thought compiler should be clever enough not to use doubles here. Result is float and other operand is also float.

Anyway its not the reason. I tried both removing division and adding f suffixes and the time was the same.

Not that simple. If the data happens to be in the cache, it’s much faster. But the cache is so tiny, this will rarely be the case. Where the texture cache is really useful is making some “almost coalesced” access patterns behave as coalesced. Example: you would access global memory by global_array[get_global_id(0)+1] which is almost coalesced (work-items read consecutive addresses, nicely ordered, etc.) but breaks the alignment rule. When you use texture instead of a global array, the cache will be able to hide this little misalignment.

However, this won’t be faster than a coalesced access to global memory. Making a coalesced access is the fastest way to get lots of stuff to/from the card’s RAM. Images/textures are also physically stored in RAM and the cache won’t magically increase the bandwidth, as I’m sure you realize. It would decrease memory workload if enough requests could be handled within the cache, but it’s just too small. What you’re doing in these kernels is completely bandwidth bound anyway with no re-use of image memory. You should be seeing no benefit from images there.

Here’s an idea. Your image is 4608x3072, which gives 14155776 elements. If you use 1d NDRanges, as you seem to since your addressing is by get_global_id(0), you launch either

16 threads per block, 884736 blocks

32 threads, 442368 blocks

64 threads, 221184 blocks

128 threads , 110592 blocks

256 threads, 55296 blocks

512 threads, 27648 blocks

The maximum number of blocks you can have in any single dimension is 65535, meaning anything less than 256 threads per block is technically illegal (shouldn’t launch). You’re busting the dimension range.

Do you check for errors?

Do you launch the kernels in blocking mode?

Yes, I execute kernel in blocking mode.

I put the image to the buffer like this:

clEnqueueWriteBuffer(opencl_context->command_queue, opencl_inputR, CL_TRUE, 0, image_width * image_height * sizeof(uint16_t), inputR, 0, NULL, NULL);

this is done once.

Then I launch kernel many times (each time gamma changes from keyboard +/-)

this is how I launch the kernel.

clFinish(opencl_context->command_queue);

	usec_timer_reset(compute_timer);

	if ((err = clEnqueueNDRangeKernel(opencl_context->command_queue, opencl_kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL)) != CL_SUCCESS)

	{

		printf("ERROR: %d\n",err);

		cleanup();

	}

	clFinish(opencl_context->command_queue);

	compute_time = usec_timer_get(compute_timer);

Also I check errors and it works.

Also I see the output correctly displayed on the screen anyway (lightens and darkens when gamma changes).

Hence I believe it works.

Michael_Fried · August 4, 2010, 4:12pm

Yes, I execute kernel in blocking mode.

I put the image to the buffer like this:
clEnqueueWriteBuffer(opencl_context->command_queue, opencl_inputR, CL_TRUE, 0, image_width * image_height * sizeof(uint16_t), inputR, 0, NULL, NULL);

You might also consider calling clCreateBuffer with CL_MEM_COPY_HOST_PTR or CL_MEM_USE_HOST_PTR so you don’t need to Enqueue the write buffer / read buffer. This should be faster. See: http://www.khronos.org/registry/cl/sdk/1.0…eateBuffer.html

this is how I launch the kernel.
clFinish(opencl_context->command_queue);

	usec_timer_reset(compute_timer);

	if ((err = clEnqueueNDRangeKernel(opencl_context->command_queue, opencl_kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL)) != CL_SUCCESS)

	{

		printf("ERROR: %d\n",err);

		cleanup();

	}

	clFinish(opencl_context->command_queue);

	compute_time = usec_timer_get(compute_timer);
Also I check errors and it works.

Also I see the output correctly displayed on the screen anyway (lightens and darkens when gamma changes).

Hence I believe it works.

I recommend using the OpenCL event here to ask the OpenCL for the profiling info for more precise information on the time the GPU actually took to calculate your results.

http://www.khronos.org/registry/cl/sdk/1.0…filingInfo.html and http://www.khronos.org/registry/cl/sdk/1.0…angeKernel.html have more information.

Basically, you want something more like:

compute_time = -1; // error condition

	cl_event event; // Don't forget to release the event later, or bad things _will_ happen...

	cl_ulong start_time = 0, end_time = 0;

	if ((err = clEnqueueNDRangeKernel(opencl_context->command_queue, opencl_kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, &event)) != CL_SUCCESS)

	{

		printf("ERROR: %d\n",err);

		cleanup();

	}

	else if( err = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_START, sizeof( cl_ulong ), &start_time, NULL ) != CL_SUCCESS )

	{

			printf("ERROR: %d\n",err);

			cleanup();

	}

	else if( err = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_END, sizeof( cl_ulong ), &end_time, NULL ) != CL_SUCCESS )

	{

			printf("ERROR: %d\n",err);

			cleanup();

	}

	clFinish(opencl_context->command_queue);

	compute_time = end_time - start_time; // in nanoseconds

Note that you will need to add the flag CL_QUEUE_PROFILING_ENABLE to your call to clCreateCommandQueue ( http://www.khronos.org/registry/cl/sdk/1.0…mmandQueue.html ).

(side note)

Some of these OpenCL APIs that I am suggesting that you use do not work equally on all OpenCL vendors implementations or all platforms or all device drivers. The list of OpenCL vendors is growing, and it is not cheap or feasible for everyone to test their code on all combinations of OS, device, and driver. I have personally found NVIDIA’s OpenCL implementation to be the best of those that I have tested, but there are still differences in the bugs in different versions of the drivers and against different devices. Make sure that your drivers are up to date and that you test on older drivers, too. Sometimes driver updates break code - using beta drivers or the latest drivers before your end users / “customers” do is the best way to safeguard your program from being broken.

(/side note)

Good luck!

-Mike

rip_someday · August 4, 2010, 9:21pm

I tried this too. I think it shouldn’t matter since I do copy data to the buffer once and then use it many times.

What I don’t undesrtand is why bandwidth can be limiting here, If all data is on gpu.

And if its not the bandwidth - why buffers are 10 times slower than image?

this measurement produces same numbers.

Some of these OpenCL APIs that I am suggesting that you use do not work equally on all OpenCL vendors implementations or all platforms or all device drivers. The list of OpenCL vendors is growing, and it is not cheap or feasible for everyone to test their code on all combinations of OS, device, and driver. I have personally found NVIDIA’s OpenCL implementation to be the best of those that I have tested, but there are still differences in the bugs in different versions of the drivers and against different devices. Make sure that your drivers are up to date and that you test on older drivers, too. Sometimes driver updates break code - using beta drivers or the latest drivers before your end users / “customers” do is the best way to safeguard your program from being broken.

(/side note)

I used beta drivers latest available at the moment.

Now they have update, so I try it to see if it helps.

I’m implementing proof-of-concept prototype now.

Probably I’m going to stick with images approach, but I’m not sure it will be the same for other platforms.

Currently I have access only to this nvidia card I mentioned in the first message.

Anyway thanks for help.

rip_someday · August 4, 2010, 11:57pm

It looks like nvidia opencl implementation transfers all the data from host to gpu and back.

with measured bandwidth around 2.2GB/s it should take around the time I got with buffers to copy data from host to gpu.

I also tried even to do the following:

create input buffers with CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR

create output buffers with CL_MEM_READ_WRITE

run kernel which just copies data from input to output

run old kernel which takes data from output buffers, applies a filter to it and copies to pbo as before

kernel which copies from input to output:

__kernel void copy(

	__global ushort* srcR, 

	__global ushort* srcG, 

	__global ushort* srcB, 

	__global ushort* dstR, 

	__global ushort* dstG, 

	__global ushort* dstB)

{

	dstR[get_global_id(0)] = srcR[get_global_id(0)];

	dstG[get_global_id(0)] = srcG[get_global_id(0)];

	dstB[get_global_id(0)] = srcB[get_global_id(0)];

}

and the result was the same.

as if output buffers where synced back to host after first kernel

and then synced to gpu with second kernel.

is there any way to find out whether this copy happens?

Topic		Replies	Views
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19367	November 26, 2015
openCL-openGL interop 4-5x slower than CG-shaders performance issues with OpenCL CUDA Programming and Performance	6	7485	June 13, 2012
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	211019	January 18, 2014
clGetEventProfilingInfo() zero end time cannot get kernel's end time while start is correct CUDA Programming and Performance	23	6147	November 14, 2010
CUDA very slow performance CUDA Programming and Performance	21	16801	March 6, 2020
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75864	February 9, 2009
Cannot get any stream parallelism. CUDA Programming and Performance	13	1303	December 31, 2019
OpenCL 1.1 driver, 8 months and waiting... CUDA Programming and Performance	67	10076	July 7, 2011
C++ kernel language hope of incoming support? CUDA Programming and Performance	9	10504	February 6, 2012
OpenCL Driver BUG? clEnqueueReadImage fails to copy data from Image Object CUDA Programming and Performance	1	4297	April 6, 2011

performance question

Related topics