Slow Performance

Hello,

I am a newbie trying to add some vectors, which I assumed would be too simple to require optimization; I think I was wrong about that, I am only getting about 5 GFLOPS using a 280GTX. Here is my kernel code:

#define BLOCK_DIM 512

__global__ void vectorAdd(float* iA,float* iB,float* oC){

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	oC[idx]=iA[idx]+iB[idx];

It is called in main program by:

int gridSize=size/BLOCK_DIM;

dim3 grid(gridSize, 1, 1);

dim3 threads(BLOCK_DIM, 1, 1);

vectorAdd<<< grid, threads >>>(d_array1, d_array2, d_array3);

Any advice on how to realize more speed, even half of the theoretical max would cause me to do cartwheels.

Thanks in advance.

What you wrote isn’t a test for floating point throughput, it’s a test for global memory bandwidth and latency.
You’re entirely dominated by the effort required to read and write to global memory.
High FLOPS usually means you’re doing most of your work with your local registers.

Steve is right. I’ll add that if you want to see how optimaly you are utilizing the device for that kernel, count up the total number of bytes read and written and divide by the time to get a GiB/s transfer rate. On GTX 280, you should be able to sustain roughly 100 GiB/s.

Thanks very much for the replies guys. Assuming that each term in the kernel expression is 4 bytes then I would assume I am getting roughly 60 GB/s. 4 bytes * 3 terms * the 5 GFLOPs I get from the speed test. Would that be an accurate account of memory throughput?

Is there a different way to code this problem to achieve high flops, rather than being dominated by the global memory read/writes?

Thanks again.

Actually, you weren’t. You were right. You got pretty close to maxing your memory bandwidth (which for larger kernels is by no means trivial).

If you want to max out the FLOPS instead, then you’ll need to think of a different problem than adding vectors.

Huh, well that’s not great news. Can you be any more specific about the types of problems that can go fast? In layman’s terms?

Could I not somehow dump the array to individual shared memory on the individual blocks to avoid the global bottleneck?

Thanks again.

It is all a matter of how much calculation you do per read/write. If you do only a addition, you have 1 FLOP for a read/write which is not a lot. If you have 50 FLOPS for every read/write, you reach a high FLOP count, but it will likely be slower than doing only an addition. FLOP count is not so interesting, running time is.

Well first of all, 60GB/s throughput is fast. Many algorithms running on the CPU are also bandwidth limited and corresponding memory bandwidths for the CPU are an order of magnitude lower than for the GPU - for example, RAM bandwidth is usually around 5GB/s (about the bandwidth of PCIe 2.0) while GPU’s RAM is about as fast as a modern CPU’s L2 cache, and in the case of GTX280, as fast as L1 cache (!) so by all means that’s fast. FLOPS aren’t the only method of measuring performance.

But if you want to go really fast, the algorithms that will benefit most from GPGPU are the ones that do a lot of computations on data in-between memory accesses. That is:

  1. Read data from global memory to either shared memory or registers,
  2. Perform dozens or hundreds of arithmetic operations on data in shared memory or registers
  3. Store the output in global memory

Vector addition is basically one arithmetic operation and 3 memory operations (load B, load A, store C).

If you want to know how many arithmetic operations you need to really max out the GPU’s FLOPS capability, check this post [url=“http://forums.nvidia.com/index.php?s=&showtopic=45752&view=findpost&p=250179”]http://forums.nvidia.com/index.php?s=&...st&p=250179[/url]
This benchmark is (as most benchmarks) utterly silly and over the top, but you can get the general idea - if you want to top out the flops, do a lot of floating point operations and as little memory operations as possible.

The little secret is: GPUs are also exceedingly good at tasks that get bounded by memory bandwidth and most real kernels are such. All those projects on [url=“http://www.nvidia.com/object/cuda_home.html”]http://www.nvidia.com/object/cuda_home.html[/url] reporting 50x better performance than CPU - they are probably all bandwidth capped. The bottom line is always running time, as E.D. Riedijk mentioned.

Thanks guys that makes a lot of sense. I’m guessing using

iA[idx]+=iB[idx];

is also technically 3 memory ops as it yielded no improvement and was maybe even slower.

I guess I’ll have to work on a cluster system to get my speeds up from here, unless there is something else someone can suggest.

Thanks again.

I was able to get about 50% more speed on kernel execution using short rather than float type, luckily my application doesn’t require great accuracy. Would still like to squeeze out more speed if anyone has any more pointers.

Thanks,

What is your application actually trying to do? Vector-vector operations are always slow. Vector-matrix and matrix-matrix are much faster.

P.S. if you use short2, you’ll get another boost, and char4 will be even better.

Thanks again Alex,
It’s mainly time history data analysis so my intuitive inclination was towards 1D arrays. Are you suggesting that if I have a routine that takes my data from 1D to 2D before crunching and then back out on other side I can achieve much greater speed. That would be pretty interesting. While I don’t need float type accuracy short2 may be too coarse, but perhaps worth a shot.

Well, I should say that if you just add two matrices, it will be just slow. But many other operations give more opportunity to optimize.

In essence, look for memory locality. See if you can read a block of data (eg into shared memory), and do several things with it. “Time history data analysis” sounds like it could be amenable to that.

How can I effectively load a block of 1D data into shared memory. I understand the concept that if some global data of BLOCK_DIM size can be loaded in just one read to shared memory to be accessed by all the threads then the ratio of operations to global reads will go up by the block size. I’m a bit confused as to how this gets implemented on the kernel level. If this read and associated load into shared memory is placed in the kernel then how does the compiler know not to do it in each and every thread and just read it once when a new block launches. My understanding thus far is that the kernel executes for each and every thread. My attempt at looping these reads in the kernel in similar fashion to the matrixMul example in the programing guide slowed things way down.

Because one thread doesn’t load the whole shared memory, but just a piece. When all the threads are run, they all load their piece. When they’re done, they call __synchronizeThreads()

Make sure you’re still doing coalescing.

So each thread will not overwrite the shared block sized array even though it is defined from scratch in the kernel each time? I have tried the following 2 methods and I’m not real sure how the operations are being scheduled.

Some are on the block level and some on the thread level?

I thought it may require a if clause that would only create the shared memory array if a new block was started (when the block id changed) while utilizing it as already loaded the rest of the time, but it sounds like that is not the case. I’m assuming this is still coalesced as the memory locations should be sequential, though clearly I’m still a bit in the dark on the entire coalescing concept, beyond sequentially addressed memory and block sizes in multiples of half warps.

very slow method, adopted from matrix mulitply example:

#define BLOCK_DIM 512

__global__ void vectorAdd(short* iA,short* iB,short* oC){

	int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	__shared__ short iAs[BLOCK_DIM];

	__shared__ short iBs[BLOCK_DIM];

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	//__shared__ short iCs[BLOCK_DIM]

	for (int i=start;i<finish+1;++i){

		iAs[threadIdx.x]=iA[i];

		iBs[threadIdx.x]=iB[i];

	}

	__syncthreads();

	//oC[idx]=iA[idx]+iB[idx];

	oC[idx]=iAs[threadIdx.x]+iBs[threadIdx.x];

}

2nd attempt, about 70% speed of original non shared attempt

__global__ void vectorAdd2(short* iA,short* iB,short* oC){

	__shared__ short iAs[BLOCK_DIM];

	__shared__ short iBs[BLOCK_DIM];

	int idx=threadIdx.x+blockDim.x*blockIdx.x;

	iAs[threadIdx.x]=iA[idx];

	iBs[threadIdx.x]=iB[idx];

	__syncthreads();

	//oC[idx]=iA[idx]+iB[idx];

	oC[idx]=iAs[threadIdx.x]+iBs[threadIdx.x];

}

If only C vector changes, you may try to use texture for A and B, it gives a great improvement of speed because of read from cache.

Textures cache isn’t very useful. Textures are good vs uncoalesced global loads, but give no improvement vs coalesced global loads.

iAs[threadIdx.x]=iA[threadIdx.x+blockDim.x*blockIdx.x];

is better than

iAs[threadIdx.x]=iA[i];

because in the first case, the read from iA is coalesced.

However, the data you’re loading into shared mem is being used only by the thread that loads it. There’s no sharing of data, and so using shared memory doesn’t help (just adds a little overhead). (You can’t accelerate a simple vector add. But you can look at your algorithm and figure out a better optimization.)

I guess it’s a bit confusing if you want to look at it 100% logically, but simply writing:

global void myKernel(){

__shared__ int smemarray[10];

}

guarantees that smemarray is allocated just once per block and is shared by everyone.

So [i] and [threadIdx.x+blockDim.x*blockIdx.x] don’t represent the same int? Seems like it would if threadIdx.x incremented by 1 and started at zero for every new block, though I’ve missed the obvious before.

So shared memory only helps out when the data is used more than once, that makes perfect sense; was just hopeful that it wasn’t the case.

Oh, I see what you’re doing. No, actually your code is wrong

int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	for (int i=start;i<finish+1;++i){

		iAs[threadIdx.x]=iA[i];

		iBs[threadIdx.x]=iB[i];

	}

As you can see, you’re writing to the same location a dozen different values, constantly overwriting the old with the new. It’s equivalent to this code:

int start=blockIdx.x*BLOCK_DIM;

	int finish=start+BLOCK_DIM-1;

	iAs[threadIdx.x]=iA[finish];

	iBs[threadIdx.x]=iB[finish];

Which I’m sure wasn’t your intent.