optimization loop in kernel

Hey guys, I am a new newbie… Could you give some ideas about this kernel? I wanted to optimize it but didn’t know what to do…

global void kernel(… projection p, float step_size,
float sample_spacing, float pos) {

y = UMAD(blockIdx.x, blockDim.x, threadIdx.x);
z = UMAD(blockIdx.y, blockDim.y, threadIdx.y);

i = z*p.det.projydim + y;

num_steps = …;

for(int j=0;j<num_steps;j++) {
sum+=tex3D(project_tex, pos.x, pos.y, pos.z);

d_proj [ i ] =sumstep_sizesample_spacing;

thanks a lot

pos will be computed a new value in each thread, so, it is hard to resure the tex3D content.

I thought about using threads to do the for loop, but not clear how to optimize it.