CUDA Project output problem

Hi all!

I have recently “finished” my project(thesis) in CUDA. When I say “finished” I am saying that I transformed an 1h30m code to 10-12 minutes and I have obtained the expected output(a breast mamography). However my image has a non-wanted grid that matches the iterations I am doing: The image is 352x896 and my dimGrid and dimBlock are, respectively, (11,7) and (8,8) so I am reconstructing image blocks of 88x56 at a time , which accounts for 4 X-axis iterations and 16 Y-axis iterations, which is the size of the grid that appears in the image.

I have lost count at how many times I have reviewed my code and I am not expecting anyone to waste their time as I know I am probably the only one who can correct this but I would like some advice. And I will post some parts of my code which I think are critical to this error

The iterations for the image blocks(88x56)

for(int i=0;i<352;i=i+88)

		for(int j=0;j<(896);j=j+56){	

			iteration(i, j, object, scale, xbinsize,ybinsize,zbinsize,detectorXDim,detectorYDim,NSlices,xfocus,yfocus,zfocus,matprojections);

The only program that uses the variables i and j(there are three similar to this one, as it corresponds to the intersections of the ray(x-ray source to bin detector) with X,Y and Z axis)

__global__ void sysmaty(float*intersectionsx,float*intersectionsy,float*intersectionsz,int*biny,float xfocus,float yfocus, float zfocus, float xbinsize,float ybinsize, int zbinsize,int detectorXDim,int detectorYDim, int NSlices, int iiterationsu,int jiterationsu)


	int tx=threadIdx.x, ty=threadIdx.y,bx=blockIdx.x, by=blockIdx.y;

	float x,y,z,t;

	int idy=(ty+by*BLOCKSIZE)+jiterationsu;

	int idx=(tx+bx*BLOCKSIZE)+iiterationsu;

	float slopeVectorx=xfocus-((float)idx+0.5)*xbinsize;

	float slopeVectory=yfocus-((float)idy+0.5)*ybinsize;

	float slopeVectorz=zfocus+17.0f;





	int yint=idy+1;

		for(yint=yint; yint<=yfocus/(float)SCALE && yint<=detectorYDim;yint++){ 





			z=-17.0f+t*slopeVectorz;//supostamente onde ta zero, sera uma coordenada de z	







	else if(idy*ybinsize>yfocus){

		int yint=idy-1;

		int counter=idy+1;

		for(yint=yint,counter=counter; yint>=yfocus/(float)SCALE && yint>=0;yint--, counter++){














The initialization of the arrays used in the previous function

float *intersectionsYx_h=(float*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*detectorYDim*sizeof(float));

	float *intersectionsYy_h=(float*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*detectorYDim*sizeof(float));

	float *intersectionsYz_h=(float*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*detectorYDim*sizeof(float));

	int *binY_h=(int*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*detectorYDim*sizeof(int));

The output image(the image in reality is 704x896x60) but that is just a slice of the 60 Slices, and it is just the half of the detector that contains the breast. The other half is obtained by simmetry in one of the reconstruction steps.

Uploaded with

Thank you in advance ;)

While I don’t understand any of the detail of what you are doing, I notice that your conditionals and loop counts involve floating point arithmetics with apparently no provisions for rounding errors:

Apparently you want the kernel to take no action at all in the case [font=“Courier New”]idy*ybinsize==yfocus[/font]. That expression however isn’t reliable if [font=“Courier New”]ybinsize[/font] and [font=“Courier New”]yfocus[/font] aren’t integer multiples of a (potentially negative) power of two. [font=“Courier New”]ybinsize=0.1[/font] would be a simple example for such a troublesome value.

Basing the number of loop iterations on a comparison of floating point quantities has a similar problem - you might end up with one iteration more or less than expected.

Thank you for the tip althought that does not seem to be the problem because if yfocus=152,5 I just want the intersection with 152, so the loop will contain 152 but not 153. Also I have the grid when yfocus is greater/lesser than every idy

This code looks suspicious:

float slopeVectorx=xfocus-((float)idx+0.5)*xbinsize;
float slopeVectory=yfocus-((float)idy+0.5)*ybinsize;

The 0.5 seems to be an offset to focus on the center.

However the multiplication is done after, which multiplies the offset as well, which seems strange to me.

This should be done vice versa.

Try multiplieing first, and when calculation is done add +0.5

Also why is this done:

int yint=idy+1;

^ Seems suspicious as well.

Possible reasons for the grid:

  1. One-off mistakes.
  2. Rounding mistakes.
  3. Need more data need edges.

And again the same odd formula is used:

slopeVectorz;//supostamente onde ta zero, sera uma coordenada de z

The offset is added before the multiplication.

Be aware that floating points have imprecision in them, when you start multiplieing floats any imprecision is also exagerated/increased.

Therefore try to do as much of the calculations as possible with integers only, and add any floating point values at the very end.

So again, try flipping this code around:

  1. First do multiplication
  2. Then add +0.5 offset.

This code also looks suspicious in combination with:

int yint=idy+1; // why +1 ?

// below: idy is not +1 ?
// and: yint-1

There seems to be a lot of -1 and +1 compensations, which could be an indication that you’re having troubles with compensating for zero-based arrays or one-based arrays or formula problems which need to compensate for it.


My suggestion for you if you want further help or help yourself:

Try (re-)writing your code in simple pseudo form and present all data structures here in pseudo form… especially the ranges of the arrays and such.
Perhaps this will help as to spot the one-off problem. Or simply puts some code from the data structures/arrays.

Well thanks for all the replys Skybuck. Indeed it looks strange but it must be that way. I am working with a 704x896 Detector. The +0.5 is to work with the center of the bin, so for instance, when working with the ray that starts on the X-Ray Source and goes to the (0,0) bin, the Straight Line equation is obtained with the centre of the bin (0+0.5,0+0.5), thats why the 0.5 needs to be multiplied as well because if not, I would be summing two different scales.

The other point you mentioned(the inty=idy+1): That comes from the fact that I want to obtain the intersections of the ray with the X,Y,Z axis. So thread(bin) (0,0) will have intersections with x=[1:DetectorXDim], y=[1:DetectorYdim] and z=[0:NSlices]

The way I am putting the values in the array is this:

I have 88x56 threads in each iteration, I allocate memory so that each thread(bin) has the maximum number possible of intersections with the axes([1:DetectorAxisDim]) so…


detectorYDim=896 which means that the first thread, idx=0,idy=0 will put its intersections in the range [0:896] but thread idx=1,idy=0 will put in the range [896:896*2]. I will write some pseude code of my program and I will post it here when I am finished.

I am really grateful for the help, as this is what is keeping me from finishing the thesis and become an Engineer lol

I finished writing my pseudo code:

//only half of the detector is obtained, the other is symmetrical(around x=DetectorXDim/2)

for each X-Ray Source position

	determine X-Ray Source coordinates

		for each xbin(0: detectorXDim/2-1)

			for each ybin(0: detectorYDim-1){


				Slopevectorx= XRayPosX - (xbin+0.5)*xbinsize;

				Slopevectory= XRayPosY - (ybin+0.5)*ybinsize;

				Slopevectorz= XRayPosZ - (-17);


				//straigh line equation: (x,y,z) = (xbin,ybin,zbin) + t*(Slopevectorx,Slopevectory,Slopevectorz)

				// x=xbin+t*Slopevectorx

				for each Xinteger(xbin+1:DetectorXDim/2){




				add to intersectionsx array;

				//do the same to YInteger(ybin+1:detectorYDim) and ZInteger(0:NSlices)


				for each intersection{

					if( xcoord < 0 OR xcoord > detectorXDim/2*xbinsize OR ycoord < 0 OR ycoord > detectorYDim*ybinsize OR zcoord < 0 OR zcoord > NSlices)

					remove intersection


				group the intersections(each one of the 3 kernels(for xinteger,yinteger,zinteger) has 4 arrays: bin,xcoord,ycoord,zcoord, group all the resulting bins,xcoord,ycoord and zcoord in 4 arrays with the size equal to the sum of the size of the arrays of each kernel)


				sort intersections(using xcoord or zcoord)

				for each intersections(1:Numberintersections-1):{

					distance=||intersection - previousintersections||

					//find the corresponding image voxel




						distBinXcoord= integer(xcoord)




						distBinZcoord= integer(zcoord)


						distBinYcoord= integer(y)





							distBinYcoord= integer(ycoord)




I will try to make the array indexing a little bit less confusing though I dont think there are many ways to work around it. If there is any more information you need in order to help me just ask Skybuck ;)

A general description of what the algorithm is trying to do, and what the algorithm is for would help too.

Some further questions: what is ment with “bin” ? Does it mean an array index (/element) ?

From what I understand this seems to be some kind of 3D algorithm concerning rays.

Is the algorithm trying to shoot rays through a 3D volume ?

The code also mentions “intersections”, what is it trying to intersect ? Is it trying to detect the first voxel that is set ?

If so then this is a grid traversing algorithm ?

I know a grid traverse algorithm which is real fast and probably flawless but ok… there is a document/pdf of it too on the internet… it did require some inventing of formula’s but I managed to do so…

The code also seems to mention “detectors” what is ment with that ? Is this the “voxel” which is hit “detector” ?

There seem to be multiple images/slices… which seem to be a “picture-ized” version of a volume.

Perhaps the volume is so large that it needs to be split up in slices… hence the pictures.

The algorithm seems to use a simple line linear algorithm to do the stepping, perhaps it’s trying to detect the edges of the voxels in a way… so it does seem like a voxel traversal algorithm. Which is probably what rays are all about anyway…

Perhaps the intersection array is used to store the ray’s traveling… it can be done directly because there is no 3D volume, so all the rays must be split up and processed in little fragments, this could mean the kernel tries to advance each ray’s step/position at a time… then multiple kernel launches or the main loop tries to advance the rays and see if more hits occur with the new pictures/slices.

A question which remains is:

Why are the pictures split up into further blocks ? Why are they simply not processed in it’s entirety ?

A picture of 1920x1200x3 bytes or so… would take something like 6 megabytes… so it should easily fit… so why the split up ?

In case this is a voxel traversal algoritm then this paper might give you some idea how to seek the edges of the voxels better and quicker then compensate for the center or so:

“A Fast Voxel Traversal Algorithm for Ray Tracing”

The document is a bit thin on details but gives a general idea…

Little typo which happens:

“it can be done directly because there is no 3D volume”

should be:

“it can’t be done directly because there is no 3D volume”

So basically the steps I posted in the pseudo code are part of the one of the two parts of the larger code.

I will try to make a short intro to my thesis:

I am working with a tomosynthesis scanner that “shoots” X-rays through the 3D Volume(the breast which is 70489660) and there is a detector(704*896) that measures how the rays are “absorbed” in the 3D Volume. The first step in my work is to obtain the System Matrix, which is independent of the result from the scanner, it just depends on the position of the X-Ray Source: It is the group of all the rays that go from each bin of the detector(In my case I am treating each bin as each thread of my kernel) to the X-Ray source. After that I obtain the intersections of each ray with each voxel of the 3D Volume to obtain the probability that one ray has to intersect one voxel(this part is not in the pseude code, basically it is just a normalization, I sum all the distances of one ray and I divide each distance by that sum. So in the end, my system matrix consists of 3 arrays: The bin of the detector(in other words, the ray), the voxel array(which contains all the voxels that each ray intersects) and the probability array(the probability that each voxel has of being intercepted by the ray)

So a very small example would be

Bin     Voxel     Probability

0        10          0,3

0        14          0,7

1        14          0,5

1        15          0,5

With the System Matrix and the Scanner input I then go on to the second step of my code: the OSEM algorithm, which, frankly, I think it is right for a couple of reasons: Ihe Breast image I obtain is the same as the CPU-obtained one(with the exception of the grid) and I never use the Iterations variables in the OSEM code.

Also I cannot do everything at the same time for one reason: If I allocate memory for all the threads intersections I would have this:


Translating to numbers and only for the Y Axis intersections I would have: 704*896*3584*4*4 = 36 Gb. It would be a little bit lower because I have 3 float and 1 int array instead of 4 float arrays, but it would still be very high.


From your explanation it is becoming a bit more clear to me.

There is a 3D array (which I hope you declared as global/device) which contains the voxel. (What exactly is a voxel in this case ?)

There is a 2D array which is the “detector”, the detector shoots out rays for each index.

For this 2D array there seems to be a thread per index. The thread traverses the ray and wants to know per voxel which it hits what it’s true hit probability is.

There seems to be some summing going on, for some reason this cannot happen on the fly ? So instead you want to store each collision/hit/intersection in a special intersection array per ray.

So each ray has an intersection array <- which more or less creates another 3D volume: 2D detector * 1D ray = 3D.

So this should still fit in memory.

Perhaps you are using the wrong data structure, since your numbers/formula’s seem to explode memory-wise.

So here is my tip:

Store the x,y,z per ray intersection, instead of trying to store it per x,y,z axis ?

Or perhaps there is a reason why you need to store it per x,y,z axis ? <- which would be a bit weird but ok.

So my last real question:

Why do you need to store all the ray intersections, why not process it on the fly ? For what reason is this not possible ?

So for example each ray is limited to 1000 intersections at most.

float = single; // 32 bit floating point.

TRayIntersection = record
x : float;
y : float;
z : float;

TRayIntersectionArray = array[0…999] of TRayIntersection;

Each ray also needs a counter to keep track of the number of intersections

TRayIntersectionCount = integer;

Then your detector can be something like:
TDetectorRayIntersection = array[0…703, 0…895] of TRayIntersectionArray; // 3d array
TDetectorRayIntersectionCount = array[0…703, 0…895] of TRayIntersectionCount; // 2d array

I hope you know pascal, then you can understand this idea, and if it’s useable then you can translate it to C ;)

Or if idea is not useable then please explain why not ;) :)

I have only one 3D array that is the Object to be reconstructed, and it is always in the GPU, and I only copy it to the CPU at the end of each iteration, to free space for the other operations. This object is composed of 70489660 voxels

The detector itself is not an array. The threads of the kernel “simulate” the detector, and those threads indexes are saved in the array Bin(which is a 1D array containing the detector bin)

Also each ray does not have an intersection array, before I obtain the probabilities of each voxel I have 4 1D arrays, like this:

Bin(Ray)   X        Y      Z

0        0.34     0.14     0

0        0.3451   0.1453   0.124

1        0.68     0.141    0

1        0.6842   0.1473   0.248

So I have 4 arrays, and this way I can associate each ray to each set of intersections and then I go on to calculate the distance between them as long as bin[idx]=bin[idx+1] so in this case, the first distance would be ((0.3451-0.34)² + (0.1453.0.14)² + (0.124-0)²)^0.5 and then I find the voxel like this(0.34 is the scale I am working, in the ZAxis the scale is 1):

0.3451/0.34=1.015 -> Xvoxel= 1      0.1453/0.34=0.42735 -> Yvoxel=0       0.124/1=0.124 -> ZVoxel=0    

Voxel = Xvoxel + Yvoxel*DetectorXdim + ZVoxel*DetectorXDim*DetectorYDim=1

I am not processing it on the fly because to obtain the distances between each voxel I need the coordinates of the n and n+1 intersections.

I am trying to think of someway to this parte of the System Matrix in Thrust, because when I made this I did not know thrust, but the OSEM segment of my code is 100% Thrust. I am thinking of creating an array which contains 0:351 Xbins and another with the Ybins

Ehehe I know zero of Pascal :P The problem with your idea is that is a very different approach from what I have done and that ought to make me change all the second part of my code :/