SSE in OpenCL where can i get SSE comands for OpenCL

hi recently i have written a opencl, in which there was many number of global memory access after using local memory to reduce global memory access , the speed was increased a lot.
is there any other technique that can increase speed actually using private memory was really down fall of speed in my example

i am planning to increase it much more can i get unroll loop examples of NVIDIA any where.

and how to handle these float4 things in the if loops for example

float4 fillter4=(0,1,2,3);
if(fillter?? <5)
{
//operation
}

please do help me with this
Thanks in advance.
Best Regards
Megharaj.

I think you have some misconceptions of a few notions, which let me try to clear out: (but please do see the invisible ‘AFAIK’ before all lines)

  • SSE is not a GPU instruction set, it is the extension of the x86 instructions. Although GPUs are SIMD-like architectures on the MultiProcessor level, vector operations (that operate on float4 for eg.) are not to be confused with them. Vector operations in reality are serialized on the ALU level, as NV cards have scalar ALU. The reason why it can bring a speedup in your application is because you might get better FETCH/ALU ratio by moving more data into registers at once, making it easier to hide latencies, or your daata operations became coalesced by using vector types.

  • Using vectors in conditionals is not possible and that has a very good reason. Namely you cannot compare apples and oranges. How do you imagine comparison will be done on scalar and vector types? Based on first element? Based on the length of the vector? Based on the sum of it’s elements? Or will it compare to all elements seperately and branch 4 times according to the 4 comparisons? In reality, when the kernel encounters such a comparison between a scalar and a vector, the scalar type is widened to meet the length of the vector by filling all x.y.z.w values with the original value, then the comparison is done component-wise (float4 result = (float4)(a.x < 5 , a.y < 5 , a.z < 5 , a.w < 5);) (Keep in mind that logical result in vector operations result in -1 when true, as opposed to 1 in scalar logical operators!). This result however cannot be used in branching (if()). Only scalar values can be used inside if( value ){…}

Hope that clears it up.

Thanks a lot for the replay

"

As in our previous conversation we had planned to try float4 these so that we can increase the speed because

• One thread calculates 4 pixel at a time.

• Coalesced access through vectors.

So I have tried example , vector add


Case:1

Global_worksize=788432;

__kernel void trying(__global float *trying_vector_a,__global float *trying_vector_b,__global float *trying_vector_c)

{

size_t n = get_global_id(0);

trying_vector_c[n]=trying_vector_a[n]+trying_vector_b[n];

}

On running the kernel 100 times I am , the time consumed is 0.2 seconds on my device.

Case :2

Global_worksize=788432/4;

__kernel void trying(__global float *trying_vector_a,__global float *trying_vector_b,__global float *trying_vector_c)

{

float4 result;

size_t n = get_global_id(0);

result=vload4(n,trying_vector_a) +vload4(n,trying_vector_b);

vstore4(result,n,trying_vector_c);

}

On running the kernel 100 times I am , the time consumed is 0.5 seconds on my device.

The output is coming to be correct but it is more time consuming. What might be the reason.

Best Regards

Megharaj.

Thanks a lot for the replay

"

As in our previous conversation we had planned to try float4 these so that we can increase the speed because

• One thread calculates 4 pixel at a time.

• Coalesced access through vectors.

So I have tried example , vector add


Case:1

Global_worksize=788432;

__kernel void trying(__global float *trying_vector_a,__global float *trying_vector_b,__global float *trying_vector_c)

{

size_t n = get_global_id(0);

trying_vector_c[n]=trying_vector_a[n]+trying_vector_b[n];

}

On running the kernel 100 times I am , the time consumed is 0.2 seconds on my device.

Case :2

Global_worksize=788432/4;

__kernel void trying(__global float *trying_vector_a,__global float *trying_vector_b,__global float *trying_vector_c)

{

float4 result;

size_t n = get_global_id(0);

result=vload4(n,trying_vector_a) +vload4(n,trying_vector_b);

vstore4(result,n,trying_vector_c);

}

On running the kernel 100 times I am , the time consumed is 0.5 seconds on my device.

The output is coming to be correct but it is more time consuming. What might be the reason.

Best Regards

Megharaj.

First of all, these two kernels are highly dominated by __global latency, not even bandwidth. Memory movement is miniscule, and operations also, so runtime depends on kernel launch overhead and fetching of data from __global. Second, you lose all of the benefits of using float4 of not grouping fetch operations together, and ALU operations. You fetch a float4, and when you would add it to something, you have another fetch, whereas you could’ve used a single fetch in the begininng of the program causing a stall only once.

Most of these things are covered in NV OpenCL Programming Guide and AMD APP OpenCL Programming Guide. I suggest reading some of those as they are very useful for getting the bigger picture.

First of all, these two kernels are highly dominated by __global latency, not even bandwidth. Memory movement is miniscule, and operations also, so runtime depends on kernel launch overhead and fetching of data from __global. Second, you lose all of the benefits of using float4 of not grouping fetch operations together, and ALU operations. You fetch a float4, and when you would add it to something, you have another fetch, whereas you could’ve used a single fetch in the begininng of the program causing a stall only once.

Most of these things are covered in NV OpenCL Programming Guide and AMD APP OpenCL Programming Guide. I suggest reading some of those as they are very useful for getting the bigger picture.

Thanks a lot for the replay.

Thanks a lot for the replay.

Thanks for the previous replay , Recently i am facing a situation well i would like to explain it with an example which goes below

int previous_pixel;
fs=get_global_id(0);

us=fs%frame_width;

if(us==0)
previous_pixel=0;

if(input_buffer[fs]==12)
previous_pixel=fs;

else if(previous_pixel!=0)
{
//operation being done
calculate some value h here than,
previous_pixel=h;
}

well this is my condition i how can i solve this dependency problem for previous_pixel. Its taking the value zero for all threads.

Thanks in advance
Best regards
Megharaj