CUDA Floyd algorithm optimization Parallel Floyd algorithm optimiztion giving incorrrect results.

Hi guys! I was trying to implement an optimized version of the Floyd algorithm in CUDA due to the slow of my previous [topic=“95970”]implementation[/topic], so I intended to use the optimization strategies and apply it to my case, but with this implementation I don’t get the correct results and even when I thought I was making coalesce memory access the test show me the opposite, probably a wrong indexation could be the problem but I couldn’t find it, and have being scrutinizing the code for days, so any help will be greatly appreciated.

This is the kernel launch configuration:

[codebox]dim3 dimBlock;

dim3 dimGrid;

dimBlock.x = 8;

dimBlock.y = 16;

dimGrid.x = (int)ceil((float)N/dimBlock.x);

dimGrid.y = 1;

[/codebox]

And this is the kernel:

[codebox]global void floydKernel_Loop128(float *weightMatrix, int *predMatrix,unsigned int N,unsigned int u)

{

__shared__ float pred[128];

__shared__ float weightVW[128];

__shared__ float columnVU[8];//blockDim.x

__shared__ float rowUW[16];//blockDim.y

volatile float additionVUW;

volatile unsigned int index;

volatile unsigned int indexVW;

volatile unsigned int indexAux;



volatile unsigned int vStart;

volatile unsigned int w;

volatile unsigned int gV;

volatile unsigned int gU;

volatile unsigned int maxW;

vStart = __mul24(blockDim.x,blockIdx.x);

if(vStart + threadIdx.x < N)

{

	gV = __mul24(vStart,N) + __mul24(threadIdx.x,N);

	//global row u

	gU = __mul24(u,N);

	w = threadIdx.y;

	maxW = N;

	//gmem: globalRow + localColumn

	indexVW = gV + w;

	//smem: localBlockRow + localColumn

	indexAux = __mul24(blockDim.y,threadIdx.x);

	index = indexAux + threadIdx.y;

	__syncthreads();

	for(indexAux = gU + w; w < maxW && index < 128;	w+=blockDim.y,indexVW = gV + w,indexAux = gU + w)

	{			

		//removing code branching

		//if(v != w && w != u && u != v)

		

		//copy to smem

		pred[index] = predMatrix[indexVW];

		weightVW[index] = weightMatrix[indexVW];

		

		rowUW[threadIdx.y] = weightMatrix[indexAux];

		//only 1 transaction per row

                    //this is not a coalesced gmem access

		if(threadIdx.x == 0)

			columnVU[threadIdx.x] = weightMatrix[gV + u]; 

		__syncthreads();

		//make comparison with all

		//the destinations copied by the half warp

		for(volatile unsigned int i = 0; i < blockDim.y; i++)

		{

			additionVUW = columnVU[threadIdx.x] + rowUW[i];

			if(weightVW[index] > additionVUW)

			{

				weightVW[index] = additionVUW;

				pred[index] = u+1;

			}

			__syncthreads();

		}

		__syncthreads();

		//write results back to gmem

		weightMatrix[indexVW] = weightVW[index];

		predMatrix[indexVW] = pred[index];

		__syncthreads();

	}

}

}[/codebox]

The intention was to access global memory with a coalesced pattern in order to fully use the transaction bandwidth, that is why every half warp(the width of the block) access a contiguous element in the array, with the exception of 1 transaction that is the same for the entire row in the block.

Best regards!

Lermy

So far I have found one problem, but there appears to be at least one other thing wrong.

if(threadIdx.x == 0)

	columnVU[threadIdx.x] = weightMatrix[gV + u];

When attempting to copy the column of 8 values (from row w, column u), this will copy a single value 16 times, for only the first row of the block. And it will do so every time through the loop.

Change to this:

if (w == 0) {

	columnVU[threadIdx.x] = weightMatrix[gV + u];

}

Or better yet, move it outside the loop and make it like this:

if (threadIdx.y == 0) {

	columnVU[threadIdx.x] = weightMatrix[gV + u];

}

As for not coalescing, I think it has to do with having x and y “backwards”. Because “adjacent” threads have adjacent threadIdx.x values (except when wrapping to the next threadIdx.y). In your implementation, x is the row, while y is the column, and you want to load multiple columns in one transaction for rowUV values. These will never coalesce because for example (x=0, y=0) is not adjacent to (x=0, y=1) for coalescing purposes.

First I’m going to try to figure out where the other bugs are to get correct output, then swapping x for y should help with coalescing. In general I believe it’s best to get something working first, no matter how slow (reloading multiple times, etc), and then make incremental changes that preserve correctness and improve performance. If the correctness fails, then there is only the increment that needs to be examined, instead of the whole algorithm.

The other problem comes from this loop:

[codebox]__global__ void floydKernel_Loop128(float *weightMatrix, int *predMatrix, unsigned int N, unsigned int u) {
__shared__ float pred[128];

__shared__ float weightVW[128];

__shared__ float columnVU[8];//blockDim.x

__shared__ float rowUW[16];//blockDim.y

const int v = blockDim.x * blockIdx.x + threadIdx.x;

if (v < N) {

	const int gV = v*N;

	const int gU = u*N;

	// index within the block, ranges from 0 to 128

	const int index = blockDim.y * threadIdx.x + threadIdx.y;

	__syncthreads();

	//only 1 transaction per row

	//this is not a coalesced gmem access

	if (threadIdx.y == 0) {

		columnVU[threadIdx.x] = weightMatrix[gV + u];

	}

	for(int w = threadIdx.y; w < N; w+=blockDim.y) {

		const int indexVW = gV + w;

		//copy to smem

		pred[index] = predMatrix[indexVW];

		weightVW[index] = weightMatrix[indexVW];

		if (threadIdx.x == 0) {

			rowUW[threadIdx.y] = weightMatrix[gU + w];

		}

		__syncthreads();

						

		float additionVUW = columnVU[threadIdx.x] + rowUW[threadIdx.y];

		if (weightVW[index] > additionVUW) {

			weightVW[index] = additionVUW;

			pred[index] = u;

		}

		__syncthreads();

		//write results back to gmem

		weightMatrix[indexVW] = weightVW[index];

		predMatrix[indexVW] = pred[index];

		__syncthreads();

	}

}

}[/codebox]

I don’t have my profiler set up right now, so I can’t help with that right now, but I believe the algorithm is structured to coalesce properly if you swap x for y. You may also need to pad your array so N is a multiple of 16.

Jamie K thanks for looking at the code, you were right by pointing that I should move it outside of the loop, in past code I was coping another value that was incrementing with the loop and when I change it I didn´t realize that I could do it just one time. I also fix the condition so only the first thread in every row in the thread block is the only one that makes the copy.

I also agree with you that is better to have something working first and then try to optimize it later, thanks a lot for your time.

JK don’t worry about the u+1, that is due to the index that I used to access the intermediary in the array, in the old implementation coincide with the start of the loop, now I use a list in wich the first element is accessed by 1, that’s why I add 1 since the loop starts in 0.

About the kernel code, I made some changes to the code you posted, first, the variable rowUW now ranges from 0 to 128 in wich I keep now the length of the path from vertex u to w for every element that every row of the block access, second I fixed as you advice me and I switch the x,y values so y could be the row and x the column for coalescing purpose, I don’t know how I think the wrong way, my bad.

And well the code it still give me the wrong results but is running at 7x faster than cpu for thousands of vertex in the graph, thanks a lot for correcting me, it’s a lot better than my previous code with correct results which could only make 2x faster, in this one I also had to invert the x with the y.

This night I wont sleep until I find what is that little thing that is messing up the results of this implementation.

Here I leave you the code:

[codebox]

global void floydKernel_Loop128(float *weightMatrix, int *predMatrix, unsigned int N, unsigned int u)

{

__shared__ float pred[128];

__shared__ float weightVW[128];

__shared__ float columnVU[8];

__shared__ float rowUW[128];

const int v = __mul24(blockDim.y,blockIdx.x) + threadIdx.y;

if (v < N)

{

	const int gV = __mul24(v,N);

	const int gU = __mul24(u,N);

	// index within the block, ranges from 0 to 128

	const int index =  __mul24(blockDim.x ,threadIdx.y) + threadIdx.x;

	//only 1 transaction per row

	//this is not a coalesced gmem access

	if (threadIdx.x == 0) 

		columnVU[threadIdx.y] = weightMatrix[gV + u];

	__syncthreads();

	for(volatile int w = threadIdx.x; w < N; w+=blockDim.x) 

	{

		const int indexVW = gV + w;

		//copy to smem

		pred[index] = predMatrix[indexVW];

		weightVW[index] = weightMatrix[indexVW];

		rowUW[index] = weightMatrix[gU + w];

		__syncthreads();

		float additionVUW = columnVU[threadIdx.y] + rowUW[index];

		if (weightVW[index] > additionVUW) 

		{

			weightVW[index] = additionVUW;

			pred[index] = u+1;

		}

		__syncthreads();

		//write results back to gmem

		weightMatrix[indexVW] = weightVW[index];

		predMatrix[indexVW] = pred[index];

		__syncthreads();

	}

}

}[/codebox]

I also attached an image, just for a better understanding, with the type of indexation I’m aiming.

Best regards!

Lermy

Another thing you may want to try, is keep a shared variable that indicates whether any of the cells have been updated. If not, then you can skip writing back to memory. If this happens most of the time, this will save bandwidth and give a corresponding boost in speed. You can also skip loading and saving the pred matrix if none of the cells need updating.

In other words:

__shared__ int thisChunkUpdated;

...

for(volatile int w = threadIdx.x; w < N; w+=blockDim.x)

{

  const int indexVW = gV + w;

thisChunkUpdated = 0;

  //copy to smem

  weightVW[index] = weightMatrix[indexVW];

  rowUW[index] = weightMatrix[gU + w];

__syncthreads();

int thisThreadUpdated = 0;

  float additionVUW = columnVU[threadIdx.y] + rowUW[index];

  if (weightVW[index] > additionVUW)

  {

	thisThreadUpdated = 1;

	thisChunkUpdated = 1;

	weightVW[index] = additionVUW;

  }

__syncthreads();

if (thisChunkUpdated) {

	pred[index] = predMatrix[indexVW];

	if (thisThreadUpdated) {

	  pred[index] = u+1;

	}

	//write results back to gmem

	weightMatrix[indexVW] = weightVW[index];

	predMatrix[indexVW] = pred[index];

  }

__syncthreads();

}

(I haven’t tested this, I am just writing to demonstrate the idea)

If the cells that need updating are sparse, this could theoretically improve the speed by up to 4x. If you wanted to get really fancy, you could save or skip on a row-by-row basis, which would increase the likelihood that you could skip.