Hi, I have some questions about the performance of OpenCL kernels.
I’m trying to implement kernel which operates on RGB 16-bit images
RGB16 can’t be done in single image (RGBA16 is the only option) and I don’t want to waste memory for A channel.
Hence I tried 2 variants for the input image:
- use 3 images (clCreateImage2d) with single 16-bit channel for R,G and B
- use 3 buffers with (clCreateBuffer) with 16-bit values for R,G,B
Output buffer is PBO.
My hardware is Palit Geforce GTX 250 E-Green (512 Gb GDDR3, nv clock 675 Mhz, memory clock 900 Mhz). I’m using Gentoo linux (nvidia-drivers version is 256.35)
I’m testing my kernels with 14 MPx image (4608x3072)
I’m getting following numbers (measured with differents between 2 getlocaltime outputs, according to pseudocode: "clFinish(); getlocaltime(…); clEnqueueNDRange(…); clFinish(); getlocaltime(…); )
-
3 images variant
localworksize time(sec):
16x1 0.00864
32x1 0.00661
64x1 0.00429
128x1 0.00586
256x1 0.00591
512x1 0.00603
16x2 0.00476
16x4 0.00370
16x8 0.00350
16x16 0.00375
16x32 0.00528 -
3 buffers variant (1d buffers):
localworksize time(sec)
16 0.03752
32 0.03556
64 0.03097
128 0.03215
256 0.03336
512 0.03556
Questions (see opencl code sources below):
- why 16x8 is the fastest option?
- why variant with buffers works almost 10 times slower?
- are these times normal?
- Can per-pixel computations for 10+ Mpx image be done faster?
Opencl sources:
helper function:
unsigned int convert_output(float r, float g, float b )
{
return ((unsigned int)(r255.0) & 0x000000FF) |
(((unsigned int)(g255.0) << 8) & 0x0000FF00) |
(((unsigned int)(b*255.0) << 16) & 0x00FF0000);
}
image variant:
__kernel void gammaCorrect(
__read_only image2d_t srcR,
__read_only image2d_t srcG,
__read_only image2d_t srcB,
__global unsigned int* dst, sampler_t sampler, float gamma)
{
int2 pos = { get_global_id(0), get_global_id(1) };
float4 r1 = read_imagef(srcR, sampler, pos);
float4 g1 = read_imagef(srcG, sampler, pos);
float4 b1 = read_imagef(srcB, sampler, pos);
float r = native_powr(r1.x,gamma);
float g = native_powr(g1.x,gamma);
float b = native_powr(b1.x,gamma);
dst[mul24(pos.y,get_global_size(0)) + pos.x] = convert_output(r,g,b );
}
buffer variant:
__kernel void gamma(
__global ushort* srcR,
__global ushort* srcG,
__global ushort* srcB,
__global unsigned int* dst, float gamma)
{
float r = native_powr((float)srcR[get_global_id(0)] / 65536.0,gamma);
float g = native_powr((float)srcG[get_global_id(0)] / 65536.0,gamma);
float b = native_powr((float)srcB[get_global_id(0)] / 65536.0,gamma);
dst[get_global_id(0)] = convert_output(r,g,b );
}