floyd on cuda--why so slow?

I have writen a program to implement floyd arithmetic.
O(n3) time cost arithmetic on serial CPU,code below:
for (int k = 0;k < N;k++)
for (int i = 0;i < N;i++)
for (int j = 0;j < N;j++)
{
type temp = m[i * N + k] + m[k * N + j];
if (m[i * N + j] > temp)
m[i * N + j] = temp;
}

my cuda code below:
//Upper is a 2048*2048 int matrix on global mem
//k is the count of loop;this kernel must run 2048 times
//N = 2048
global void UpperGPUMB(type Upper,int k)
{
int r,c;
type temp;
for (int i = blockIdx.x * blockDim.x + threadIdx.x;i < N * N;
i += gridDim.x * blockDim.x)
{
r = i / N;
c = i - r * N;
temp = Upper[r * N + k] + Upper[k * N + c];
if (Upper[i] > temp)
Upper[i] = temp;
}
}
when I use 6blocks and 512 threads per block,6800GT,it needs about 31S;
however,a CPU 4 cores 2.0G,runs serial arithmetic,only needs about 12S;
the matrix is 2048
2048,int.

I discovered that,if I disabled “Upper[i] = temp”(write to global memory),the speed is quite fast,only takes 16ms. It means that cuda spend most time on whiting to global memory.

Okay,my question is that,How can I improve this program??
Thanks

Actually, if you comment out that line, the dead code analysis in the compiler probably optimized away the entire loop, leaving an empty kernel.

Just glancing at the indexing, I imagine all of the time is being spent on this line:

temp = Upper[r * N + k] + Upper[k * N + c];

These accesses will be uncoalesced, which will reduce your total memory bandwidth substantially. (And your entire kernel is memory bound, so this is the main bottleneck.)

Matrix algorithms that read data in a non-linear fashion are a little tricky to implement fast, unfortunately. You need to find a way to read chunks of data from Upper in a coalesced way, probably into shared memory. As a starter example of how this can be done, check out the matrixMul project in the CUDA SDK.

When I comment out “Upper[i] = temp”(yes maybe compiler disabled the whole if),it only took 16ms,it means “temp = Upper[r * N + k] + Upper[k * N + c]” didn’t cost much. yes?

If you had implemented floyd,you know temp needs K row and K col,so it is certain to access global memory uncoalescedly,in my opinion. Maybe I can read K row and K col into shared memory first.

floyd arithmetic is quite interesting,if you try to implement it on cuda.If you are interested in it,you can have a try.

Thanks for you advice!

As seibert pointed out, coalescing your memory accesses will be the first step. Coalescing can improve performance by a factor of ~20. If you can’t coalesce reads, you can use a texture for reading, but you still need to coalesce writes.

Additionally, 6 blocks won’t utilize the full potential of your device. If possible, try to find a way break up the calculation into 100’s or 1000’s of blocks in each kernel call.

Based on my practice experience,to many blocks reduce efficiency.For example,8600GT has 4 multiprocessors,so over 8 or more blocks can’t improve program(unless there is a lot write-read waiting or __synthreads() cost). To this program,it slowed down when more than 10 blocks.

the code below is a vesion designed for coalesce write,however,it is even slower,can you tell me why? thanks!

global void UpperGPUMBAdvanced(type *Upper,int k)

{

int r,c,count,i;

type temp = 0;

//each thread write consecutive N Upper[i],as you can see i++ and count++

for (i = blockIdx.x * N * 256 + threadIdx.x * N,count = 0;count < N;i++,count++)

{

r = i / N;

temp = Upper[r * N + k] + Upper[k * N + count];

if (Upper[i] > temp)

	Upper[i] = temp;

}

}

Actually, I’m thinking that commenting out that line disables the entire for loop, since it no longer has any side effects. You can use the --ptx flag to nvcc to see the PTX assembly code generated in both cases.

I think there is some problems hiding here,too. Because call a empty kenerl used 0.16ms,so 2048 times needs 327ms;however,when I comment out that line,it only took 16ms. Thanks for your opinion,I wiil check it carefully.

Still a question–How can I write to global memory faster? besides coalesce

Well, I was assuming a 16 multiproc board in that advice. With a reasonable ~4 blocks per multproc you need at least 64 blocks to fully saturate such a device device. As you increase above this, performance scales linearly so there is never a reason to run fewer blocks that do more work each.

You can’t. Memory is uncached and coalescing is the only way to get decent memory write performance.

OKay,I have writen a high-efficiency program on cuda.

Given a 2048*2048 int matrix,its speed is 2.27x faster than CPU(8600GT VS. 4 cores 2.0G)(3.9 s VS. 13s)

Thanks to Seibert,you are quite cool. You rapidly pointed out the key problem.

temp = Upper[r * N + k] + Upper[k * N + c];

Yes,this line cost most time.So I move it to shared memory.

Upper[i] = temp;

Besides,when I comment out this line,NVCC seems really make the whole loop empty.

Here is my floyd code,it is a successful version on cuda.

If someone can make it better,please tell me

//64blocks,512 threads per block

//each thread iterate 128 times

//a block process 32 lines,2048 int per line

__global__ void UpperMBWRSharedwithoutWR(type *Upper1,int k)

{

	int r,c,i;

	int limit = (blockIdx.x + 1) * 32 * N;

	type temp;

	__shared__ type row[N];

	__shared__ type col[32];

	if (threadIdx.x == 0)  

	{

  for (i = 0;i < 32;i++)

  {

  	col[i] = Upper1[(blockIdx.x * 32 + i) * N + k];

  }

	}

	__syncthreads();

	for (i = threadIdx.x;i < N;i += blockDim.x)

	{

  row[i] = Upper1[k * N + i];

	}

	__syncthreads();

	for (int i = blockIdx.x * N * 32 + threadIdx.x;i < limit;i += blockDim.x)

	{

  r = i / N;

  c = i - r * N;

  temp = row[c] + col[r & 31];

  if (Upper1[i] > temp)

  	Upper1[i] = temp;

	}

	

	

}

PS:this is a kernel;it needs a “shell function” to launch it 2048 times

A quick improvement is to use 32 rather than 1 thread in the first if-statement. This will get rid of the serial for-loop and issue the 32 reads concurrently.

Also, you can fetch data through a 1D texture bound to gmem pointer, writing back to gmem. This is safe in this case, despite the cache being only read-aware, since Warshall-Floyd algorithm is correct no matter whether you read the ik and kj distances from the previous step, or updated by the current one. Since cache is invalidated for every kernel launch, you’ll only get one of these values no matter what.

Are you computing just the upper triangle on the CPU, or the entire matrix (directed graph)? Also, is the CPU time for a multi-threaded version or just one thread?

Do you have CUDA time on a G80? If not, post your code (preferably a single self-contained .cu file).

Paulius

Your sequential code is of order N^3

But your parallel code is handling on N^2.

First of all, Are you getting correct results?

The kernel is called O(N) times with O(N) threads, and each thread has an O(N) loop. (Roughly speaking)

I see… Kernel is called N times. I missed that. Thanks 4 pointing out.

Yes,the kernel needs to be launched N times,you can put it in a N for loop.If you have any problem,I can send my complete code to you.

As paulius said,when I used 32 thread read in parallel,it speed up a litter,3.24X to 3.39X. However,it can’t improve the program significantly.

I want to modify in this way:

for (i = threadIdx.x;i < N;i += blockDim.x)

{

 row[i] = Upper1[k * N + i];

}

each thread read a int one time;Okay, I will try to let each thread read 32 bytes one time,may be it will speed up a little.When I have result,I will put it up immediately. :)

I used a more simple code with textures.

cpu

int d_dim = 1234; // square matrix row size

...

		dim3 dimBlock;

		dim3 dimGrid;

		dimBlock.x = ( d_dim < 16 ? d_dim : 16 );

		dimBlock.y = dimBlock.x;

		dimGrid.x  = (int)ceil( d_dim / (float)dimBlock.x );

		dimGrid.y  = dimGrid.x;

		int percent = 0;

		for( int k = 0; k < d_dim; k++ )

		{

			floyd<<< dimGrid, dimBlock >>>( k );

		}

gpu

__global__ void floyd( int k )

{

	unsigned int i = __mul24( blockDim.x , blockIdx.x ) + threadIdx.x;

	unsigned int j = __mul24( blockDim.y , blockIdx.y ) + threadIdx.y;

	if( i != j && i < d_dim_c && j < d_dim_c ) // d_dim_c is in constant memory

	{

		D_d_cPtr[ getDevDistPos( i, j ) ] =

		devMin(					 tex1Dfetch( D_t, getDevDistPos( i, j ) ) ,

					 ( i == k ? 0 : tex1Dfetch( D_t, getDevDistPos( i, k ) ) ),

					 ( k == j ? 0 : tex1Dfetch( D_t, getDevDistPos( k, j ) ) )

					 );

	}

}

D_d_cPtr is a pointer in constant memory to a distance matrix in global memory. D_t is the corresponding texture reference.

getDevDistPos is only needed in my project because I have to use a special data organization. devMin is an alternative to the normal min function, which uses negative values as positiv infinities.

IMHO the speed is satisfying.

The only disadvantage is a very slow overall system.

I noticed some loop iteration speed loss when k reached 1500 with a 4500 network matrix, but I think it is a problem of uncoalesced memory accesses which are caused by my data organization.

Can somebody explain the overall system performance loss?

Question moved to this [topic=“95970”]Topic[/topic].