My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi

The attached is my reduction code. It’s quite simple and not fully optimized, but the measured performance is already fast and makes me quite surprised.

The obtained result on my machine (GTX 295) is:

[font=“Courier New”]Use device 0: “GeForce GTX 295”

Measured bandwidth=106.4GB/s

Reduce size=135782400, use time=5.105375s, GPU_result=611004689, CPU_result=611004689

Test PASSED[/font]

and we know that the theoretical peak of GTX 295 is 9992448/8=111.9GB/s. So the absolute efficiency is about 95%. I don’t know what is the fastest reduction code in the world. But this score is obviously faster than the SDK reduction example, which is about 98-99GB/s measured on the same machine (only 88% to the theoretical peak).

I don’t know whether this code’s efficiency runs on other GPUs is as efficient as 295, any comments are appreciated.

Note: The default paremeter in the code needs about 500M global memory, so if you have not enough memory you should reduce the value of macro K. And the default paremeter is optimized for GTX 295, on other GPUs it might not be the optimized value and should be manually tuned for the best performance.

(a more efficient code is posted on #13 of this thread)

#include "cutil_inline.h"

#define M		(240*20)

#define N		64			

#define K		442			// Please reduce this value if out of memory

#define SIZE		(M*N*K)

__global__ void reduce1(int *data, int *res) {

	__shared__ int shm[N];

	int inx=blockIdx.x*blockDim.x+threadIdx.x;

	int s=0;

// #pragma unroll 1000						 

	for(int j=inx; j<SIZE; j+=M*N) s+=data[j];

	shm[threadIdx.x]=s;

	__syncthreads();

	if(threadIdx.x==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		res[blockIdx.x]=s;

	}

}

__global__ void reduce2(int *data) {

	__shared__ int shm[N];

	int inx=threadIdx.x;

	int s=0;

	for(int j=inx; j<M; j+=N) s+=data[j];

	shm[threadIdx.x]=s;

	__syncthreads();

	if(inx==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		data[0]=s;

	}

}

void do_reduction(int *data, int *res) {

	reduce1<<<M,N>>>(data, res);

	reduce2<<<1,N>>>(res);

}

int A;

int main() {

	int GPU_result, CPU_result=0, dev=0;

	int *data, *res;

	unsigned int t1;

	srand(2010);

	cudaSetDevice(dev);

	cudaDeviceProp deviceProp;

	cutilSafeCall(cudaGetDeviceProperties(&deviceProp, dev));

	printf("Use device %d: \"%s\"\n", dev, deviceProp.name);

	cutCreateTimer(&t1);

	cutilSafeCall(cudaMalloc((void**)&data, SIZE*sizeof(int)));

	cutilSafeCall(cudaMalloc((void**)&res, SIZE*sizeof(int)/K));

	for(int i=0; i<SIZE; i++) A[i]=rand()%10, CPU_result+=A[i];

	cudaMemcpy(data, A, SIZE*sizeof(int), cudaMemcpyHostToDevice);

	do_reduction(data, res);

	cudaThreadSynchronize();

	cutStartTimer(t1);

	for(int ii=0; ii<1000; ii++) do_reduction(data, res);

	cudaThreadSynchronize();

	cutStopTimer(t1);

	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(&GPU_result, res, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Measured bandwidth=%.1fGB/s\n", 1E-9*SIZE*sizeof(int)*1000/dt);

	printf("Reduce size=%d, use time=%fs, GPU_result=%d, CPU_result=%d\n",

		SIZE, dt, GPU_result, CPU_result);

	printf("Test %s\n", (GPU_result==CPU_result) ? "PASSED" : "FAILED");

}

#pragma unroll 1000 in production code can be a bad idea - what if the dataset is smaller than 1000? Or is not evenly divisible by 1000?

Have you tried removing it? You are definitely memory bound so unrolling loops might not be so necessary.

What is the bandwidth given by the bandwidth tester in SDK? 95% of theoretical peak would be a lot even for a simple 100% coalesced memcopy.

Thanks, you are right. After I removed the statement “#pragma unroll 1000” the performance of this code grows up to 106.5GB/s, which is 0.1GB/s faster.

The bandwidth given by the SDK bandwidth tester is only about 94.5GB/s on my machine, that is to say, the effective bandwidth of above reduction code is 12% faster than it.

Hi, Big Mac. I would like to make several additional explanations about this topic. (Sorry my English is not good but I’m trying best to express my points)

First, perhaps I can not agree that 95% to the theoretical peak bandwidth is trivial (on platforms similar to me) and a simple coalesced memcopy can ensure applications to that performance. If this is true, why both the SDK reduction example and the SDK bandwidth tester can not even reach a value of 90%?

As we know, the reduction example in CUDA SDK is designed to serve as a good optimization example for memory-bound applications (which has also been included as an example in many CUDA books). There is an associated document named reduction.pdf provided together with the source in the SDK. I don’t know what your first impression about it is, but I was a little shocked when I read it at the first time. From kernel1 - kernel7, after taking advantages of non-divergent branching, bank conflicts free, and et al., the effective bandwidth is increased from 2GB/s to above 60GB/s, corresponding to a speedup of more than 30x. And there is a note “Kernel 7 on 32M elements: 73 GB/s!” on page 34, which is a final result after all optimizations. The theoretical peak of the GPU used in the test is 86.4GB/s, so the best score achieved here is about 84% (to the peak). Readers would feel as if this was an impressive performance and performances higher than it might be hard to reach.

After reading this example I also suppose the reachable peak bandwidth in a real application is around 85%, because it seems that all kind of optimizations have already been used. But I think the code is a little complicated and with too many limits. So I try to write a simpler code by myself. I don’t care the efficiency problem at first and code is not fully optimized at all. However, I was surprised when I got the benchmark result, it can easily achieved >90% efficiency while the SDK example cannot, I feel a little being fooled when I see this result.

Until now, I’m not very clear about why SDK reduction’s best performance is far away from the theoretical peak. On my platform, the SDK reduction reaches its best score when size=16M or 32M and the performance will be hurt if we still increase the size. However, my code reaches its best score when the size is larger than 100M. I think perhaps it is because the SDK reduction uses too much smem. We should reduce the use of smem, on the contrary, use registers as possible, isn’t it? Maybe this is the right way to the peak of both bandwidth and Gflops.

Hi

I never said reaching 95% of peak is trivial, not at all :) I meant that the simple memcopy example, which should logically reach as much of the peak as possible, performs worse than your code. But actually it also performs worse than their reduction so what the hell.

By the way, results of my 8800 GTS 512 (compute 1.1 so res[blockIdx.x]=s; is uncoalesced):
bandwidthTest: 50,6 GB/s
SDK reduction (default params): 52 GB/s
Your reduction (K=200): 54,3 GB/s
Theoretical peak: 61 GB/s

Your reduction gives me ~90% of peak, very good result and the code’s pretty.

If I don’t forget I’ll try to run it on a S1070 later.

I’m sorry that I misuderstand the meaning of your sentenses due to my poor English. :sweat:

For the problem of relatively low performance of the SDK bandwidth tester, I think perhaps the main reason is that it use memery copy to measure the bandwidth while the reduction program only needs memory read, maybe it is not too strange that memory read is faster than memory copy.

Very interesting result. I can see that your approach has a more efficient second reduction and more friendlier to the execution pipeline as it does not require copying data to the host and so do not need to fully stop to synchronise with the host .
Testing with different input size I can see that your code is significantly slower if size is less than 16M, about the same speed with 32M and faster with more than 32M on the GTX 260.

Thanks for your experiments.

But are you testing for different sizes by changing the value of K? When K becomes to a small number it will significantly lose performance. A better way to test for small size is reducing the value of M, which will usually result in a much better performance.

Thank YOU, cuda210. For interest, I spliced your main() routine into our worker

thread func and ran it simultaneously on the 4 GPUs inside my two GTX 295’s:

bandwidth on device 2 =106.3GB/s (PASSED)

bandwidth on device 3 =106.3GB/s (PASSED)

bandwidth on device 0 =106.3GB/s (PASSED)

bandwidth on device 1 =104.4GB/s (PASSED)

PS: device 1 runs the sci.viz GUI.

PPS: this was after removing the #pragma unroll 1000

Fine cards, these 295’s.

Nigel

I can see your point, by reducing K you will reduce workload of single threads but that is not what you need for the performance, you want to reduce M to keep workload per thread as high as possible. BUt I can see that the SDk version is faster with small input number, but when the number is large somehow the synchronisation is costly and of course shared memory is slower than register. Somehow, when the load is high, the output time become negligible and mostly hidden by the reading input time , it does not matter as you output 128 number as the SDK or M like your approach it will cost the same amount of time

Hi, Linh Ha, just now I have also done a test on different input sizes for my code, however, the result is quite encouraging.

Seems my code can beat SDK recution on every input size provided that the parameters M and K are properly choosed.

Here is a detailed result for different M and K on different input sizes, and the performance for the SDK reduction with the same sizes are also listed.

Note:

  1. These parameters are only optimzed for GTX 295, maybe need to be further tuned on other GPUs (especially for parameter M).

  2. #pragma unroll 1000” is removed in following tests.

Result on different input sizes (on a GTX 295):


My code (M=240, N=64, K=34):	 60.3GB/s  (23.3% faster)

SDK reduction (size=1<<19):	  48.9GB/s

My code (M=240, N=64, K=69):	 76.0GB/s  (15.8% faster)

SDK reduction (size=1<<20):	  65.6GB/s

My Code (M=240, N=64, K=137):	86.6GB/s  (9.3% faster)

SDK reduction (size=1<<21):	  79.2GB/s

My Code (M=240, N=64, K=273):	94.2GB/s  (5.8% faster)

SDK reduction (size=1<<22):	  89.0GB/s

My Code (M=240, N=64, K=546):	99.5GB/s  (5.0% faster)

SDK reduction (size=1<<23):	  94.8GB/s

My Code (M=240, N=64, K=1092):  103.1GB/s  (5.5% faster)

SDK reduction (size=1<<24):	  97.7GB/s

My Code (M=240, N=64, K=2184):  104.9GB/s  (6.3% faster)

SDK reduction (size=1<<25):	  98.7GB/s

My Code (M=480, N=64, K=2184):  105.8GB/s  (7.3% faster)

SDK reduction (size=1<<26):	  98.6GB/s

My Code (M=720, N=64, K=2912):  106.4GB/s  (9.1% faster)

SDK reduction (size=1<<27):	  97.5GB/s

I don’t like the fact you use 1 thread to do the adds, unless that was your intention of simple. Using 1 thread leaves 7 ALUs unused. I was about to say the 24 cycle read after write latency would be significant, but since your thread blocks are small, it will just switch to another block.

Hi, this is a good idea, thank you very much.

After your comment, I apply the optimization method of “Mutiple Adds / thread” described in the SDK reduction documention to my code, and in final a surprising peformance of 108.0GB/s is obtained (on my GTX 295).

Now, the result is 108.0/111.9=96.5% to the theoretical peak and 1.6GB/s faster than the original version. The performance for small input size is also improved after this revision. Perhaps this value is now very close to the ‘real’ reachable peak bandwidth we can obtain.

Revised code:

/* A performance of 108.0GB/s is measured on GTX 295 with M=240*16, N=64, K=810 */

#include "cutil_inline.h"

#define M	(240*16)

#define N	(64)	

#define K	(810)			// Please reduce this value if out of memory

#define SIZE	(M*N*K)

__global__ void reduce1(int *data, int *res) {

	__shared__ int shm[N];

	int inx=blockIdx.x*blockDim.x+threadIdx.x;

	int s=0;

#if K%2==0

	for(int j=inx; j<SIZE; j+=M*N*2) s+=data[j]+data[j+M*N];

#else

	for(int j=inx; j<SIZE; j+=M*N) s+=data[j];

#endif

	shm[threadIdx.x]=s;

	__syncthreads();

	if(threadIdx.x==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		res[blockIdx.x]=s;

	}

}

__global__ void reduce2(int *data) {

	__shared__ int shm[N];

	int inx=threadIdx.x;

	int s=0;

#if M%(2*N)==0

	for(int j=inx; j<M; j+=N*2) s+=data[j]+data[j+N];

#else

	for(int j=inx; j<M; j+=N) s+=data[j];

#endif

	shm[threadIdx.x]=s;

	__syncthreads();

	if(inx==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		data[0]=s;

	}

}

void do_reduction(int *data, int *res) {

	reduce1<<<M,N>>>(data, res);

	reduce2<<<1,N>>>(res);

}

int A;

int main() {

	int GPU_result, CPU_result=0, dev=0;

	int *data, *res;

	unsigned int t1;

	srand(2010);

	cudaSetDevice(dev);

	cudaDeviceProp deviceProp;

	cutilSafeCall(cudaGetDeviceProperties(&deviceProp, dev));

	printf("Use device %d: \"%s\"\n", dev, deviceProp.name);

	cutCreateTimer(&t1);

	cutilSafeCall(cudaMalloc((void**)&data, SIZE*sizeof(int)));

	cutilSafeCall(cudaMalloc((void**)&res, SIZE*sizeof(int)/K));

	for(int i=0; i<SIZE; i++) A[i]=rand()%10, CPU_result+=A[i];

	cudaMemcpy(data, A, SIZE*sizeof(int), cudaMemcpyHostToDevice);

	do_reduction(data, res);

	cudaThreadSynchronize();

	cutStartTimer(t1);

	for(int ii=0; ii<1000; ii++) do_reduction(data, res);

	cudaThreadSynchronize();

	cutStopTimer(t1);

	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(&GPU_result, res, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Measured bandwidth=%.1fGB/s\n", 1E-9*SIZE*sizeof(int)*1000/dt);

	printf("Reduce size=%d, use time=%fs, GPU_result=%d, CPU_result=%d\n",

		SIZE, dt, GPU_result, CPU_result);

	printf("Test %s\n", (GPU_result==CPU_result) ? "PASSED" : "FAILED");

}

After the optimization of ‘multiple add/thread’ proved to be very useful, I want to find out why “#pragma unroll” doesn’t work, since in fact they are doing the same thing.

I’ve checked the generated PTX code with “#pragma unroll” and found out that the performance loss was due to the bad code generated by NVCC. The generated code is something like:

... ...

	ld.global.s32	 %r323, [%r220+0];

	add.s32	 %r21, %r323, %r21;

	ld.global.s32	 %r324, [%r221+0];

	add.s32	 %r21, %r324, %r21;

	ld.global.s32	 %r325, [%r222+0];

	add.s32	 %r21, %r325, %r21;

	ld.global.s32	 %r326, [%r223+0];

	add.s32	 %r21, %r326, %r21;

	ld.global.s32	 %r327, [%r224+0];

	add.s32	 %r21, %r327, %r21;

	ld.global.s32	 %r328, [%r225+0];

	add.s32	 %r21, %r328, %r21;

	ld.global.s32	 %r329, [%r226+0];

	add.s32	 %r21, %r329, %r21;

	ld.global.s32	 %r330, [%r227+0];

	add.s32	 %r21, %r330, %r21;

	ld.global.s32	 %r331, [%r228+0];

	add.s32	 %r21, %r331, %r21;

	ld.global.s32	 %r332, [%r229+0];

	add.s32	 %r21, %r332, %r21;

	add.s32	 %r4, %r4, 24576000;

	add.s32	 %r229, %r229, 98304000;

	add.s32	 %r228, %r228, 98304000;

	add.s32	 %r227, %r227, 98304000;

	add.s32	 %r226, %r226, 98304000;

	add.s32	 %r225, %r225, 98304000;

	add.s32	 %r224, %r224, 98304000;

	add.s32	 %r223, %r223, 98304000;

	add.s32	 %r222, %r222, 98304000;

	add.s32	 %r221, %r221, 98304000;

	add.s32	 %r220, %r220, 98304000;

	add.s32	 %r219, %r219, 98304000;

	add.s32	 %r218, %r218, 98304000;

	add.s32	 %r217, %r217, 98304000;

	add.s32	 %r216, %r216, 98304000;

	add.s32	 %r215, %r215, 98304000;

	... ...

So the memory-read instructions and the register-add instructions are seperated in space and the memory latency cannot be efficiently hidden. I’m confident that the performance for “#pragma unroll” will be very powerfull (even faster than multiple add/thread) provided that the NVCC could generate correct codes.

Ps, we also found that in general “#pragma unroll N” would result in performance loss with currrent generated code but “#pragma unroll 2” is an exception, it might bring a performance improvement of about 1.2GB/s than without unroll, but this is still a little slower than the result of mutiple add/thread.

Thanks for sharing, while the result is promising, I really want to find the explanation, it will be more helpful when you want to apply for the future device.

Interesting, thanks for sharing your results with us.

I’d like to ask, are the “cudaThreadSynchronize()” calls necessary?

Only for benchmarking/timing.

Sorry, i am newbie to cuda, could you please explain how that formula (9992448/8) was derived and what are its components?

999 (MHz memory clock) * 2 (DDR multiplier) * 448 (bit width/pin count of the memory interface) / 8 (bits per byte) = 111888 Mb/s

Could you please explain why x20 multiplier was used for constant M. My understanding is that x240 is the number of multiprocessors (SMs) on GPU board, but what is x20? Does it correlate somehow with maximum number of active warps per SM = 24?