Hi guys! I was trying to implement an optimized version of the Floyd algorithm in CUDA due to the slow of my previous [topic=“95970”]implementation[/topic], so I intended to use the optimization strategies and apply it to my case, but with this implementation I don’t get the correct results and even when I thought I was making coalesce memory access the test show me the opposite, probably a wrong indexation could be the problem but I couldn’t find it, and have being scrutinizing the code for days, so any help will be greatly appreciated.

This is the kernel launch configuration:

[codebox]dim3 dimBlock;

dim3 dimGrid;

dimBlock.x = 8;

dimBlock.y = 16;

dimGrid.x = (int)ceil((float)N/dimBlock.x);

dimGrid.y = 1;

[/codebox]

And this is the kernel:

[codebox]**global** void floydKernel_Loop128(float *weightMatrix, int *predMatrix,unsigned int N,unsigned int u)

{

```
__shared__ float pred[128];
__shared__ float weightVW[128];
__shared__ float columnVU[8];//blockDim.x
__shared__ float rowUW[16];//blockDim.y
volatile float additionVUW;
volatile unsigned int index;
volatile unsigned int indexVW;
volatile unsigned int indexAux;
volatile unsigned int vStart;
volatile unsigned int w;
volatile unsigned int gV;
volatile unsigned int gU;
volatile unsigned int maxW;
vStart = __mul24(blockDim.x,blockIdx.x);
if(vStart + threadIdx.x < N)
{
gV = __mul24(vStart,N) + __mul24(threadIdx.x,N);
//global row u
gU = __mul24(u,N);
w = threadIdx.y;
maxW = N;
//gmem: globalRow + localColumn
indexVW = gV + w;
//smem: localBlockRow + localColumn
indexAux = __mul24(blockDim.y,threadIdx.x);
index = indexAux + threadIdx.y;
__syncthreads();
for(indexAux = gU + w; w < maxW && index < 128; w+=blockDim.y,indexVW = gV + w,indexAux = gU + w)
{
//removing code branching
//if(v != w && w != u && u != v)
//copy to smem
pred[index] = predMatrix[indexVW];
weightVW[index] = weightMatrix[indexVW];
rowUW[threadIdx.y] = weightMatrix[indexAux];
//only 1 transaction per row
//this is not a coalesced gmem access
if(threadIdx.x == 0)
columnVU[threadIdx.x] = weightMatrix[gV + u];
__syncthreads();
//make comparison with all
//the destinations copied by the half warp
for(volatile unsigned int i = 0; i < blockDim.y; i++)
{
additionVUW = columnVU[threadIdx.x] + rowUW[i];
if(weightVW[index] > additionVUW)
{
weightVW[index] = additionVUW;
pred[index] = u+1;
}
__syncthreads();
}
__syncthreads();
//write results back to gmem
weightMatrix[indexVW] = weightVW[index];
predMatrix[indexVW] = pred[index];
__syncthreads();
}
}
```

}[/codebox]

The intention was to access global memory with a coalesced pattern in order to fully use the transaction bandwidth, that is why every half warp(the width of the block) access a contiguous element in the array, with the exception of 1 transaction that is the same for the entire row in the block.

Best regards!

Lermy