Problems when writing local memory ->global memory

Hi all,
I’m investigating the efficient of CUDA. In my program, i created 16384 threads (block size = 256 thread, grid size = 64). The kernel program has been described as follow:
global void ray_gpu( float* d_prj, float* d_img, float *d_ax, setang *d_ang)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

struct rayt ray;
float	x1, x2, y1, y2;
float _proj =0;

intersect_gpu(d_ax, d_ang, iy, ix, &x1, &y1, &x2, &y2);	

if( x1 < 0|| x2 < 0 || y1 < 0 || y2 < 0 ) 
{
	d_prj[IMGSIZ * iy + ix] = 0;
}
else
{		
    	_proj = find_gpu(x1, y1, x2, y2, 0, 0, Nx, Ny, 1, 1, &ray,  d_img);		
	d_prj[IMGSIZ * iy + ix] = _proj;
}	

}
(In this case, find_gpu and intersect_gpu is two device function.)
It works fine when i run in my GPU (Geforce 8600GT). In compare with sequential program in CPU, GPU run 5 times faster.
It takes me 13ms in GPU >< 60ms in CPU.
The thing i’m considering is:
Instead of using
d_prj[IMGSIZ * iy + ix] = _proj; //13ms-version
I use:
d_prj[IMGSIZ * iy + ix] = sqrtf(ray.size); //5ms - version
It takes me only 5ms to complete the GPU program.
I wonder to know whether: is there any better way to understand this problem?
Thanks in advance.