Slow Performance

dam5h · November 18, 2008, 4:50am

Hello,

I am a newbie trying to add some vectors, which I assumed would be too simple to require optimization; I think I was wrong about that, I am only getting about 5 GFLOPS using a 280GTX. Here is my kernel code:

#define BLOCK_DIM 512

__global__ void vectorAdd(float* iA,float* iB,float* oC){

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	oC[idx]=iA[idx]+iB[idx];

It is called in main program by:

int gridSize=size/BLOCK_DIM;

dim3 grid(gridSize, 1, 1);

dim3 threads(BLOCK_DIM, 1, 1);

vectorAdd<<< grid, threads >>>(d_array1, d_array2, d_array3);

Any advice on how to realize more speed, even half of the theoretical max would cause me to do cartwheels.

Thanks in advance.

SPWorley · November 18, 2008, 6:50am

What you wrote isn’t a test for floating point throughput, it’s a test for global memory bandwidth and latency.
You’re entirely dominated by the effort required to read and write to global memory.
High FLOPS usually means you’re doing most of your work with your local registers.

MisterAnderson42 · November 18, 2008, 12:52pm

Steve is right. I’ll add that if you want to see how optimaly you are utilizing the device for that kernel, count up the total number of bytes read and written and divide by the time to get a GiB/s transfer rate. On GTX 280, you should be able to sustain roughly 100 GiB/s.

dam5h · November 18, 2008, 1:22pm

Thanks very much for the replies guys. Assuming that each term in the kernel expression is 4 bytes then I would assume I am getting roughly 60 GB/s. 4 bytes * 3 terms * the 5 GFLOPs I get from the speed test. Would that be an accurate account of memory throughput?

Is there a different way to code this problem to achieve high flops, rather than being dominated by the global memory read/writes?

Thanks again.

alex_dubinsky · November 18, 2008, 7:48pm

Actually, you weren’t. You were right. You got pretty close to maxing your memory bandwidth (which for larger kernels is by no means trivial).

If you want to max out the FLOPS instead, then you’ll need to think of a different problem than adding vectors.

dam5h · November 18, 2008, 9:14pm

Huh, well that’s not great news. Can you be any more specific about the types of problems that can go fast? In layman’s terms?

Could I not somehow dump the array to individual shared memory on the individual blocks to avoid the global bottleneck?

Thanks again.

E.D_Riedijk · November 18, 2008, 10:07pm

It is all a matter of how much calculation you do per read/write. If you do only a addition, you have 1 FLOP for a read/write which is not a lot. If you have 50 FLOPS for every read/write, you reach a high FLOP count, but it will likely be slower than doing only an addition. FLOP count is not so interesting, running time is.

_Big_Mac · November 18, 2008, 10:31pm

Well first of all, 60GB/s throughput is fast. Many algorithms running on the CPU are also bandwidth limited and corresponding memory bandwidths for the CPU are an order of magnitude lower than for the GPU - for example, RAM bandwidth is usually around 5GB/s (about the bandwidth of PCIe 2.0) while GPU’s RAM is about as fast as a modern CPU’s L2 cache, and in the case of GTX280, as fast as L1 cache (!) so by all means that’s fast. FLOPS aren’t the only method of measuring performance.

But if you want to go really fast, the algorithms that will benefit most from GPGPU are the ones that do a lot of computations on data in-between memory accesses. That is:

Read data from global memory to either shared memory or registers,
Perform dozens or hundreds of arithmetic operations on data in shared memory or registers
Store the output in global memory

Vector addition is basically one arithmetic operation and 3 memory operations (load B, load A, store C).

If you want to know how many arithmetic operations you need to really max out the GPU’s FLOPS capability, check this post [url=“http://forums.nvidia.com/index.php?s=&showtopic=45752&view=findpost&p=250179”]http://forums.nvidia.com/index.php?s=&...st&p=250179[/url]
This benchmark is (as most benchmarks) utterly silly and over the top, but you can get the general idea - if you want to top out the flops, do a lot of floating point operations and as little memory operations as possible.

The little secret is: GPUs are also exceedingly good at tasks that get bounded by memory bandwidth and most real kernels are such. All those projects on [url=“http://www.nvidia.com/object/cuda_home.html”]http://www.nvidia.com/object/cuda_home.html[/url] reporting 50x better performance than CPU - they are probably all bandwidth capped. The bottom line is always running time, as E.D. Riedijk mentioned.

dam5h · November 19, 2008, 4:02pm

Thanks guys that makes a lot of sense. I’m guessing using

iA[idx]+=iB[idx];

is also technically 3 memory ops as it yielded no improvement and was maybe even slower.

I guess I’ll have to work on a cluster system to get my speeds up from here, unless there is something else someone can suggest.

Thanks again.

dam5h · November 20, 2008, 3:28am

I was able to get about 50% more speed on kernel execution using short rather than float type, luckily my application doesn’t require great accuracy. Would still like to squeeze out more speed if anyone has any more pointers.

Thanks,

alex_dubinsky · November 20, 2008, 5:58am

What is your application actually trying to do? Vector-vector operations are always slow. Vector-matrix and matrix-matrix are much faster.

P.S. if you use short2, you’ll get another boost, and char4 will be even better.

dam5h · November 20, 2008, 4:02pm

Thanks again Alex,
It’s mainly time history data analysis so my intuitive inclination was towards 1D arrays. Are you suggesting that if I have a routine that takes my data from 1D to 2D before crunching and then back out on other side I can achieve much greater speed. That would be pretty interesting. While I don’t need float type accuracy short2 may be too coarse, but perhaps worth a shot.

alex_dubinsky · November 20, 2008, 7:08pm

Well, I should say that if you just add two matrices, it will be just slow. But many other operations give more opportunity to optimize.

In essence, look for memory locality. See if you can read a block of data (eg into shared memory), and do several things with it. “Time history data analysis” sounds like it could be amenable to that.

dam5h · November 24, 2008, 3:20am

How can I effectively load a block of 1D data into shared memory. I understand the concept that if some global data of BLOCK_DIM size can be loaded in just one read to shared memory to be accessed by all the threads then the ratio of operations to global reads will go up by the block size. I’m a bit confused as to how this gets implemented on the kernel level. If this read and associated load into shared memory is placed in the kernel then how does the compiler know not to do it in each and every thread and just read it once when a new block launches. My understanding thus far is that the kernel executes for each and every thread. My attempt at looping these reads in the kernel in similar fashion to the matrixMul example in the programing guide slowed things way down.

alex_dubinsky · November 24, 2008, 6:40am

Because one thread doesn’t load the whole shared memory, but just a piece. When all the threads are run, they all load their piece. When they’re done, they call __synchronizeThreads()

Make sure you’re still doing coalescing.

dam5h · November 24, 2008, 4:54pm

So each thread will not overwrite the shared block sized array even though it is defined from scratch in the kernel each time? I have tried the following 2 methods and I’m not real sure how the operations are being scheduled.

Some are on the block level and some on the thread level?

I thought it may require a if clause that would only create the shared memory array if a new block was started (when the block id changed) while utilizing it as already loaded the rest of the time, but it sounds like that is not the case. I’m assuming this is still coalesced as the memory locations should be sequential, though clearly I’m still a bit in the dark on the entire coalescing concept, beyond sequentially addressed memory and block sizes in multiples of half warps.

very slow method, adopted from matrix mulitply example:

#define BLOCK_DIM 512

__global__ void vectorAdd(short* iA,short* iB,short* oC){

	int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	__shared__ short iAs[BLOCK_DIM];

	__shared__ short iBs[BLOCK_DIM];

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	//__shared__ short iCs[BLOCK_DIM]

	for (int i=start;i<finish+1;++i){

		iAs[threadIdx.x]=iA[i];

		iBs[threadIdx.x]=iB[i];

	}

	__syncthreads();

	//oC[idx]=iA[idx]+iB[idx];

	oC[idx]=iAs[threadIdx.x]+iBs[threadIdx.x];

}

2nd attempt, about 70% speed of original non shared attempt

__global__ void vectorAdd2(short* iA,short* iB,short* oC){

	__shared__ short iAs[BLOCK_DIM];

	__shared__ short iBs[BLOCK_DIM];

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	iAs[threadIdx.x]=iA[idx];

	iBs[threadIdx.x]=iB[idx];

	__syncthreads();

	//oC[idx]=iA[idx]+iB[idx];

	oC[idx]=iAs[threadIdx.x]+iBs[threadIdx.x];

}

L_F · November 24, 2008, 5:13pm

If only C vector changes, you may try to use texture for A and B, it gives a great improvement of speed because of read from cache.

alex_dubinsky · November 24, 2008, 7:30pm

Textures cache isn’t very useful. Textures are good vs uncoalesced global loads, but give no improvement vs coalesced global loads.

iAs[threadIdx.x]=iA[threadIdx.x+blockDim.x*blockIdx.x];

is better than

iAs[threadIdx.x]=iA[i];

because in the first case, the read from iA is coalesced.

However, the data you’re loading into shared mem is being used only by the thread that loads it. There’s no sharing of data, and so using shared memory doesn’t help (just adds a little overhead). (You can’t accelerate a simple vector add. But you can look at your algorithm and figure out a better optimization.)

I guess it’s a bit confusing if you want to look at it 100% logically, but simply writing:

global void myKernel(){

__shared__ int smemarray[10];

}

guarantees that smemarray is allocated just once per block and is shared by everyone.

dam5h · November 24, 2008, 8:36pm

So [i] and [threadIdx.x+blockDim.x*blockIdx.x] don’t represent the same int? Seems like it would if threadIdx.x incremented by 1 and started at zero for every new block, though I’ve missed the obvious before.

So shared memory only helps out when the data is used more than once, that makes perfect sense; was just hopeful that it wasn’t the case.

alex_dubinsky · November 24, 2008, 8:52pm

Oh, I see what you’re doing. No, actually your code is wrong

int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	for (int i=start;i<finish+1;++i){

		iAs[threadIdx.x]=iA[i];

		iBs[threadIdx.x]=iB[i];

	}

As you can see, you’re writing to the same location a dozen different values, constantly overwriting the old with the new. It’s equivalent to this code:

int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	iAs[threadIdx.x]=iA[finish];

	iBs[threadIdx.x]=iB[finish];

Which I’m sure wasn’t your intent.

Topic		Replies	Views
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
beginner question Checking if GPU is the answer to me CUDA Programming and Performance	20	12081	September 4, 2008
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Efficient use of shared memory CUDA Programming and Performance	29	4261	December 2, 2019
Effective global memory bandwidth? CUDA Programming and Performance	17	17559	September 18, 2007
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1834	November 1, 2010
Speed improvement CUDA Programming and Performance	18	8267	December 5, 2008
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37170	August 30, 2009
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5072	September 6, 2008

Slow Performance

Related topics