cuda optimization

Hi guys, I am in huge trouble…Can the below code get more optimized??

typedef struct{
float* x1;
float* y1;
float* x2;
float* y2;
} MBR2;

global static void ParallelSearch(MBR2 idata, char* odata, float x_min, float y_min, float x_max, float y_max, int numElements)
{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int numThreads = blockDim.x * gridDim.x;

float a, b, c, d;
for(int pos = tid; pos < numElements; pos += numThreads)
{
	a = idata.x1[pos];
	b = idata.y1[pos];
	c = idata.x2[pos];
	d = idata.y2[pos];

	if( x_min > c || a > x_max )
		odata[pos] = '0';
	else if( y_min > d || y_max < B)
		odata[pos] = '0';
	else
		odata[pos] = '1';	
             }

}

Above code deal with range query finding intersection mbrs…

I compared above code with just ‘for(;;)’ roof…

And It’s performance is much worse.

cuda profiler doesn’t show gld uncoalesced…but, shows many gld coalesced.

my question is whether above code has problems or not…and if it is, can be fixed?

and another question…

cudaMemcpy spent lots of time on transferring data…

If you know any way to reduce the time…please let me know…

wat does"a b c d " do ? you have no flops in your algorithm ?

yes…in above code we do not need to use flops…

Turn off smileys in your post to make your code into non-smiley code. :-)

Two optimizations, both easy.

First, if your data is somewhat coherent, notice that you load four different values, but you often end up ignoring 3 of them!
This depends a bit on nearby data having similar behavior (and thus no divergence), but if for example the x_min test hits, there’s no need to load the rest.

Second, your output is using THREE memory transactions to write results back. If your warp is diverged, you’ll use 3 times the bandwidth.
It’s better to decide your answer in a register, than just write the one answer.

Finally, your writes aren’t coalesced on 1.0 and 1.1 hardware. That may not be big deal if you are using 1.3 hardware. If you do want older cards to be faster, stage your values in shared memory then write out the results from there. Even on 1.3 hardware this may be a bandwidth win since you’re computing only one byte per thread, and you get best bandwidth with 64 byte writes.