Hi guys, I am in huge trouble…Can the below code get more optimized??

typedef struct{

float* x1;

float* y1;

float* x2;

float* y2;

} MBR2;

**global** static void ParallelSearch(MBR2 idata, char* odata, float x_min, float y_min, float x_max, float y_max, int numElements)

{

const int tid = blockDim.x * blockIdx.x + threadIdx.x;

const int numThreads = blockDim.x * gridDim.x;

```
float a, b, c, d;
for(int pos = tid; pos < numElements; pos += numThreads)
{
a = idata.x1[pos];
b = idata.y1[pos];
c = idata.x2[pos];
d = idata.y2[pos];
if( x_min > c || a > x_max )
odata[pos] = '0';
else if( y_min > d || y_max < B)
odata[pos] = '0';
else
odata[pos] = '1';
}
```

}

Above code deal with range query finding intersection mbrs…

I compared above code with just ‘for(;;)’ roof…

And It’s performance is much worse.

cuda profiler doesn’t show gld uncoalesced…but, shows many gld coalesced.

my question is whether above code has problems or not…and if it is, can be fixed?

and another question…

cudaMemcpy spent lots of time on transferring data…

If you know any way to reduce the time…please let me know…