Question about the __shared__

Hi all, I am working on a project about GPU programming, however I had runtime error on my program.
I wish to had a kernel that runs multiple threads. and one temp array which share between these threads.
By the values never returns what i needed.
Could anyone help to solve it?

shared float* temp;
global void Moller(float* tr, float* triangle, float* ray, float* view, int imageW, int imageH, int nbTri)
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

for(int i = 0;i<imageH*imageW;i++)	{temp[i] = 1.E+30f;}

if(ix*iy < nbTri){

float	v0[4];
float	v1[4];
float	v2[4];
float	c_ray[4];
float	intersect;
int		index		= 0;

v0[0] = triangle[12*(ix+iy*imageW)  ];	v0[1] = triangle[12*(ix+iy*imageW)+1];	v0[2] = triangle[12*(ix+iy*imageW)+2 ];	v0[3] = triangle[12*(ix+iy*imageW)+3 ];
v1[0] = triangle[12*(ix+iy*imageW)+4]; 	v1[1] = triangle[12*(ix+iy*imageW)+5];	v1[2] = triangle[12*(ix+iy*imageW)+6 ];	v1[3] = triangle[12*(ix+iy*imageW)+7 ];
v2[0] = triangle[12*(ix+iy*imageW)+8];	v2[1] = triangle[12*(ix+iy*imageW)+9];	v2[2] = triangle[12*(ix+iy*imageW)+10];	v2[3] = triangle[12*(ix+iy*imageW)+11];

float e1[4];
float e2[4];

//assign two edges
e1[0] = v1[0]-v0[0];	e1[1] = v1[1]-v0[1];	e1[2] = v1[2]-v0[2];
e2[0] = v2[0]-v0[0];	e2[1] = v2[1]-v0[1];	e2[2] = v2[2]-v0[2];

for(int j = 0; j < imageW*imageH-1; j++)
	float t = 0;
	c_ray[0] = ray[4*j  ];
	c_ray[1] = ray[4*j+1];
	c_ray[2] = ray[4*j+2];
	intersect = interMoller(e1,e2,v0,c_ray,view); //return a result
	if(intersect != 0 && intersect < temp[j])
		temp[j] = intersect;
		tr[j] = 0;

    }//end for j loop
}//end if loop

//Stored back into tr
for(int i = 0;i<imageH*imageW;i++) {tr[i] = temp[i];}


Move [font=“Courier New”]__syncthreads()[/font] out of the conditional. The effect of [font=“Courier New”]__syncthreads()[/font] that are not encountered by all threads of a block is undefined.

Thanks for the reply.

Had moved it out, however it still not returning what it supposed.

Is there any example on share within threads?



Looking a bit closer, I see some more problems:

    []You have only allocated a pointer in shared memory, not an array. Declare temp as [font=“Courier New”]shared float temp[imageHimageW][/font]. If imageH and imageW aren’t compile-time constants, declare temp as [font=“Courier New”]extern shared float temp[/font] and call your kernel as [font=“Courier New”]Moller<<<…, …, imageHimageWsizeof(float)>>>(…)[/font].

    The way you try to use shared memory to find a minimum for all threads of a block (or even kernel?) does not work (whether with __syncthreads() or without), as the comparisons are not atomic. You probably want to exchange (ix,iy) and j, so that the minimum operation is between the results within each thread, not for each iteration between the results of all threads.

    You don’t need an array in shared memory at all, as at any time only one of the elements is in use. Not sure though if this is because you shortened the kernel for presentation in the forum. Anyway, for any reasonable size this array probably is not going to fit into shared memory.

    In the initialization and writeback loops each thread in the block does the same, resulting in needless duplication of work.

Great thanks to the help.

Had tried your suggestions, working on that atm.

some error occurs, when changing the temp into a array it return this error .

Error 13 error : Entry function ‘_Z6MollerPfS_S_S_iii’ uses too much shared data (0x4002c bytes + 0x10 bytes system, 0x4000 max) D:\Project for Kit\rayTracing\CUDACOMPILE

change extern shared with this error

Error 6 error : local and shared variables cannot have external linkage D:\Project for Kit\rayTracing\ 71

i am still working on the other suggestions.