My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi

cuda2010 · February 19, 2010, 5:40am

The attached is my reduction code. It’s quite simple and not fully optimized, but the measured performance is already fast and makes me quite surprised.

The obtained result on my machine (GTX 295) is:

[font=“Courier New”]Use device 0: “GeForce GTX 295”

Measured bandwidth=106.4GB/s

Reduce size=135782400, use time=5.105375s, GPU_result=611004689, CPU_result=611004689

Test PASSED[/font]

and we know that the theoretical peak of GTX 295 is 9992448/8=111.9GB/s. So the absolute efficiency is about 95%. I don’t know what is the fastest reduction code in the world. But this score is obviously faster than the SDK reduction example, which is about 98-99GB/s measured on the same machine (only 88% to the theoretical peak).

I don’t know whether this code’s efficiency runs on other GPUs is as efficient as 295, any comments are appreciated.

Note: The default paremeter in the code needs about 500M global memory, so if you have not enough memory you should reduce the value of macro K. And the default paremeter is optimized for GTX 295, on other GPUs it might not be the optimized value and should be manually tuned for the best performance.

(a more efficient code is posted on #13 of this thread)

#include "cutil_inline.h"

#define M		(240*20)

#define N		64			

#define K		442			// Please reduce this value if out of memory

#define SIZE		(M*N*K)

__global__ void reduce1(int *data, int *res) {

	__shared__ int shm[N];

	int inx=blockIdx.x*blockDim.x+threadIdx.x;

	int s=0;

// #pragma unroll 1000						 

	for(int j=inx; j<SIZE; j+=M*N) s+=data[j];

	shm[threadIdx.x]=s;

	__syncthreads();

	if(threadIdx.x==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		res[blockIdx.x]=s;

	}

}

__global__ void reduce2(int *data) {

	__shared__ int shm[N];

	int inx=threadIdx.x;

	int s=0;

	for(int j=inx; j<M; j+=N) s+=data[j];

	shm[threadIdx.x]=s;

	__syncthreads();

	if(inx==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		data[0]=s;

	}

}

void do_reduction(int *data, int *res) {

	reduce1<<<M,N>>>(data, res);

	reduce2<<<1,N>>>(res);

}

int A;

int main() {

	int GPU_result, CPU_result=0, dev=0;

	int *data, *res;

	unsigned int t1;

	srand(2010);

	cudaSetDevice(dev);

	cudaDeviceProp deviceProp;

	cutilSafeCall(cudaGetDeviceProperties(&deviceProp, dev));

	printf("Use device %d: \"%s\"\n", dev, deviceProp.name);

	cutCreateTimer(&t1);

	cutilSafeCall(cudaMalloc((void**)&data, SIZE*sizeof(int)));

	cutilSafeCall(cudaMalloc((void**)&res, SIZE*sizeof(int)/K));

	for(int i=0; i<SIZE; i++) A[i]=rand()%10, CPU_result+=A[i];

	cudaMemcpy(data, A, SIZE*sizeof(int), cudaMemcpyHostToDevice);

	do_reduction(data, res);

	cudaThreadSynchronize();

	cutStartTimer(t1);

	for(int ii=0; ii<1000; ii++) do_reduction(data, res);

	cudaThreadSynchronize();

	cutStopTimer(t1);

	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(&GPU_result, res, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Measured bandwidth=%.1fGB/s\n", 1E-9*SIZE*sizeof(int)*1000/dt);

	printf("Reduce size=%d, use time=%fs, GPU_result=%d, CPU_result=%d\n",

		SIZE, dt, GPU_result, CPU_result);

	printf("Test %s\n", (GPU_result==CPU_result) ? "PASSED" : "FAILED");

}

_Big_Mac · February 19, 2010, 10:18am

#pragma unroll 1000 in production code can be a bad idea - what if the dataset is smaller than 1000? Or is not evenly divisible by 1000?

Have you tried removing it? You are definitely memory bound so unrolling loops might not be so necessary.

What is the bandwidth given by the bandwidth tester in SDK? 95% of theoretical peak would be a lot even for a simple 100% coalesced memcopy.

cuda2010 · February 19, 2010, 10:44am

Thanks, you are right. After I removed the statement “#pragma unroll 1000” the performance of this code grows up to 106.5GB/s, which is 0.1GB/s faster.

The bandwidth given by the SDK bandwidth tester is only about 94.5GB/s on my machine, that is to say, the effective bandwidth of above reduction code is 12% faster than it.

cuda2010 · February 20, 2010, 4:58am

Hi, Big Mac. I would like to make several additional explanations about this topic. (Sorry my English is not good but I’m trying best to express my points)

First, perhaps I can not agree that 95% to the theoretical peak bandwidth is trivial (on platforms similar to me) and a simple coalesced memcopy can ensure applications to that performance. If this is true, why both the SDK reduction example and the SDK bandwidth tester can not even reach a value of 90%?

As we know, the reduction example in CUDA SDK is designed to serve as a good optimization example for memory-bound applications (which has also been included as an example in many CUDA books). There is an associated document named reduction.pdf provided together with the source in the SDK. I don’t know what your first impression about it is, but I was a little shocked when I read it at the first time. From kernel1 - kernel7, after taking advantages of non-divergent branching, bank conflicts free, and et al., the effective bandwidth is increased from 2GB/s to above 60GB/s, corresponding to a speedup of more than 30x. And there is a note “Kernel 7 on 32M elements: 73 GB/s!” on page 34, which is a final result after all optimizations. The theoretical peak of the GPU used in the test is 86.4GB/s, so the best score achieved here is about 84% (to the peak). Readers would feel as if this was an impressive performance and performances higher than it might be hard to reach.

After reading this example I also suppose the reachable peak bandwidth in a real application is around 85%, because it seems that all kind of optimizations have already been used. But I think the code is a little complicated and with too many limits. So I try to write a simpler code by myself. I don’t care the efficiency problem at first and code is not fully optimized at all. However, I was surprised when I got the benchmark result, it can easily achieved >90% efficiency while the SDK example cannot, I feel a little being fooled when I see this result.

Until now, I’m not very clear about why SDK reduction’s best performance is far away from the theoretical peak. On my platform, the SDK reduction reaches its best score when size=16M or 32M and the performance will be hurt if we still increase the size. However, my code reaches its best score when the size is larger than 100M. I think perhaps it is because the SDK reduction uses too much smem. We should reduce the use of smem, on the contrary, use registers as possible, isn’t it? Maybe this is the right way to the peak of both bandwidth and Gflops.

_Big_Mac · February 20, 2010, 9:18am

Hi

I never said reaching 95% of peak is trivial, not at all :) I meant that the simple memcopy example, which should logically reach as much of the peak as possible, performs worse than your code. But actually it also performs worse than their reduction so what the hell.

By the way, results of my 8800 GTS 512 (compute 1.1 so res[blockIdx.x]=s; is uncoalesced):
bandwidthTest: 50,6 GB/s
SDK reduction (default params): 52 GB/s
Your reduction (K=200): 54,3 GB/s
Theoretical peak: 61 GB/s

Your reduction gives me ~90% of peak, very good result and the code’s pretty.

If I don’t forget I’ll try to run it on a S1070 later.

cuda2010 · February 20, 2010, 11:31am

I’m sorry that I misuderstand the meaning of your sentenses due to my poor English. External Image

For the problem of relatively low performance of the SDK bandwidth tester, I think perhaps the main reason is that it use memery copy to measure the bandwidth while the reduction program only needs memory read, maybe it is not too strange that memory read is faster than memory copy.

Linh_Ha · February 22, 2010, 8:20am

Very interesting result. I can see that your approach has a more efficient second reduction and more friendlier to the execution pipeline as it does not require copying data to the host and so do not need to fully stop to synchronise with the host .
Testing with different input size I can see that your code is significantly slower if size is less than 16M, about the same speed with 32M and faster with more than 32M on the GTX 260.

cuda2010 · February 22, 2010, 10:07am

Thanks for your experiments.

But are you testing for different sizes by changing the value of K? When K becomes to a small number it will significantly lose performance. A better way to test for small size is reducing the value of M, which will usually result in a much better performance.

nnunn · February 22, 2010, 4:52pm

Thank YOU, cuda210. For interest, I spliced your main() routine into our worker

thread func and ran it simultaneously on the 4 GPUs inside my two GTX 295’s:

bandwidth on device 2 =106.3GB/s (PASSED)

bandwidth on device 3 =106.3GB/s (PASSED)

bandwidth on device 0 =106.3GB/s (PASSED)

bandwidth on device 1 =104.4GB/s (PASSED)

PS: device 1 runs the sci.viz GUI.

PPS: this was after removing the #pragma unroll 1000

Fine cards, these 295’s.

Nigel

Linh_Ha · February 22, 2010, 6:49pm

I can see your point, by reducing K you will reduce workload of single threads but that is not what you need for the performance, you want to reduce M to keep workload per thread as high as possible. BUt I can see that the SDk version is faster with small input number, but when the number is large somehow the synchronisation is costly and of course shared memory is slower than register. Somehow, when the load is high, the output time become negligible and mostly hidden by the reading input time , it does not matter as you output 128 number as the SDK or M like your approach it will cost the same amount of time

cuda2010 · February 23, 2010, 5:10am

Hi, Linh Ha, just now I have also done a test on different input sizes for my code, however, the result is quite encouraging.

Seems my code can beat SDK recution on every input size provided that the parameters M and K are properly choosed.

Here is a detailed result for different M and K on different input sizes, and the performance for the SDK reduction with the same sizes are also listed.

Note:

These parameters are only optimzed for GTX 295, maybe need to be further tuned on other GPUs (especially for parameter M).
“#pragma unroll 1000” is removed in following tests.

Result on different input sizes (on a GTX 295):


My code (M=240, N=64, K=34):	 60.3GB/s  (23.3% faster)

SDK reduction (size=1<<19):	  48.9GB/s

My code (M=240, N=64, K=69):	 76.0GB/s  (15.8% faster)

SDK reduction (size=1<<20):	  65.6GB/s

My Code (M=240, N=64, K=137):	86.6GB/s  (9.3% faster)

SDK reduction (size=1<<21):	  79.2GB/s

My Code (M=240, N=64, K=273):	94.2GB/s  (5.8% faster)

SDK reduction (size=1<<22):	  89.0GB/s

My Code (M=240, N=64, K=546):	99.5GB/s  (5.0% faster)

SDK reduction (size=1<<23):	  94.8GB/s

My Code (M=240, N=64, K=1092):  103.1GB/s  (5.5% faster)

SDK reduction (size=1<<24):	  97.7GB/s

My Code (M=240, N=64, K=2184):  104.9GB/s  (6.3% faster)

SDK reduction (size=1<<25):	  98.7GB/s

My Code (M=480, N=64, K=2184):  105.8GB/s  (7.3% faster)

SDK reduction (size=1<<26):	  98.6GB/s

My Code (M=720, N=64, K=2912):  106.4GB/s  (9.1% faster)

SDK reduction (size=1<<27):	  97.5GB/s

Uncle_Joe · February 23, 2010, 4:43pm

I don’t like the fact you use 1 thread to do the adds, unless that was your intention of simple. Using 1 thread leaves 7 ALUs unused. I was about to say the 24 cycle read after write latency would be significant, but since your thread blocks are small, it will just switch to another block.

cuda2010 · February 24, 2010, 5:00am

Hi, this is a good idea, thank you very much.

After your comment, I apply the optimization method of “Mutiple Adds / thread” described in the SDK reduction documention to my code, and in final a surprising peformance of 108.0GB/s is obtained (on my GTX 295).

Now, the result is 108.0/111.9=96.5% to the theoretical peak and 1.6GB/s faster than the original version. The performance for small input size is also improved after this revision. Perhaps this value is now very close to the ‘real’ reachable peak bandwidth we can obtain.

Revised code:

/* A performance of 108.0GB/s is measured on GTX 295 with M=240*16, N=64, K=810 */

#include "cutil_inline.h"

#define M	(240*16)

#define N	(64)	

#define K	(810)			// Please reduce this value if out of memory

#define SIZE	(M*N*K)

__global__ void reduce1(int *data, int *res) {

	__shared__ int shm[N];

	int inx=blockIdx.x*blockDim.x+threadIdx.x;

	int s=0;

#if K%2==0

	for(int j=inx; j<SIZE; j+=M*N*2) s+=data[j]+data[j+M*N];

#else

	for(int j=inx; j<SIZE; j+=M*N) s+=data[j];

#endif

	shm[threadIdx.x]=s;

	__syncthreads();

	if(threadIdx.x==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		res[blockIdx.x]=s;

	}

}

__global__ void reduce2(int *data) {

	__shared__ int shm[N];

	int inx=threadIdx.x;

	int s=0;

#if M%(2*N)==0

	for(int j=inx; j<M; j+=N*2) s+=data[j]+data[j+N];

#else

	for(int j=inx; j<M; j+=N) s+=data[j];

#endif

	shm[threadIdx.x]=s;

	__syncthreads();

	if(inx==0) {

		int s=0;

		for(int i=0; i<N; i++) s+=shm[i];

		data[0]=s;

	}

}

void do_reduction(int *data, int *res) {

	reduce1<<<M,N>>>(data, res);

	reduce2<<<1,N>>>(res);

}

int A;

int main() {

	int GPU_result, CPU_result=0, dev=0;

	int *data, *res;

	unsigned int t1;

	srand(2010);

	cudaSetDevice(dev);

	cudaDeviceProp deviceProp;

	cutilSafeCall(cudaGetDeviceProperties(&deviceProp, dev));

	printf("Use device %d: \"%s\"\n", dev, deviceProp.name);

	cutCreateTimer(&t1);

	cutilSafeCall(cudaMalloc((void**)&data, SIZE*sizeof(int)));

	cutilSafeCall(cudaMalloc((void**)&res, SIZE*sizeof(int)/K));

	for(int i=0; i<SIZE; i++) A[i]=rand()%10, CPU_result+=A[i];

	cudaMemcpy(data, A, SIZE*sizeof(int), cudaMemcpyHostToDevice);

	do_reduction(data, res);

	cudaThreadSynchronize();

	cutStartTimer(t1);

	for(int ii=0; ii<1000; ii++) do_reduction(data, res);

	cudaThreadSynchronize();

	cutStopTimer(t1);

	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(&GPU_result, res, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Measured bandwidth=%.1fGB/s\n", 1E-9*SIZE*sizeof(int)*1000/dt);

	printf("Reduce size=%d, use time=%fs, GPU_result=%d, CPU_result=%d\n",

		SIZE, dt, GPU_result, CPU_result);

	printf("Test %s\n", (GPU_result==CPU_result) ? "PASSED" : "FAILED");

}

cuda2010 · February 24, 2010, 2:32pm

After the optimization of ‘multiple add/thread’ proved to be very useful, I want to find out why “#pragma unroll” doesn’t work, since in fact they are doing the same thing.

I’ve checked the generated PTX code with “#pragma unroll” and found out that the performance loss was due to the bad code generated by NVCC. The generated code is something like:

... ...

	ld.global.s32	 %r323, [%r220+0];

	add.s32	 %r21, %r323, %r21;

	ld.global.s32	 %r324, [%r221+0];

	add.s32	 %r21, %r324, %r21;

	ld.global.s32	 %r325, [%r222+0];

	add.s32	 %r21, %r325, %r21;

	ld.global.s32	 %r326, [%r223+0];

	add.s32	 %r21, %r326, %r21;

	ld.global.s32	 %r327, [%r224+0];

	add.s32	 %r21, %r327, %r21;

	ld.global.s32	 %r328, [%r225+0];

	add.s32	 %r21, %r328, %r21;

	ld.global.s32	 %r329, [%r226+0];

	add.s32	 %r21, %r329, %r21;

	ld.global.s32	 %r330, [%r227+0];

	add.s32	 %r21, %r330, %r21;

	ld.global.s32	 %r331, [%r228+0];

	add.s32	 %r21, %r331, %r21;

	ld.global.s32	 %r332, [%r229+0];

	add.s32	 %r21, %r332, %r21;

	add.s32	 %r4, %r4, 24576000;

	add.s32	 %r229, %r229, 98304000;

	add.s32	 %r228, %r228, 98304000;

	add.s32	 %r227, %r227, 98304000;

	add.s32	 %r226, %r226, 98304000;

	add.s32	 %r225, %r225, 98304000;

	add.s32	 %r224, %r224, 98304000;

	add.s32	 %r223, %r223, 98304000;

	add.s32	 %r222, %r222, 98304000;

	add.s32	 %r221, %r221, 98304000;

	add.s32	 %r220, %r220, 98304000;

	add.s32	 %r219, %r219, 98304000;

	add.s32	 %r218, %r218, 98304000;

	add.s32	 %r217, %r217, 98304000;

	add.s32	 %r216, %r216, 98304000;

	add.s32	 %r215, %r215, 98304000;

	... ...

So the memory-read instructions and the register-add instructions are seperated in space and the memory latency cannot be efficiently hidden. I’m confident that the performance for “#pragma unroll” will be very powerfull (even faster than multiple add/thread) provided that the NVCC could generate correct codes.

Ps, we also found that in general “#pragma unroll N” would result in performance loss with currrent generated code but “#pragma unroll 2” is an exception, it might bring a performance improvement of about 1.2GB/s than without unroll, but this is still a little slower than the result of mutiple add/thread.

Linh_Ha · February 27, 2010, 9:43am

Thanks for sharing, while the result is promising, I really want to find the explanation, it will be more helpful when you want to apply for the future device.

zeus13i · March 4, 2010, 12:11am

Interesting, thanks for sharing your results with us.

I’d like to ask, are the “cudaThreadSynchronize()” calls necessary?

_Big_Mac · March 5, 2010, 9:10am

Only for benchmarking/timing.

u35tpus · March 21, 2010, 5:46pm

Sorry, i am newbie to cuda, could you please explain how that formula (9992448/8) was derived and what are its components?

avidday · March 21, 2010, 6:00pm

999 (MHz memory clock) * 2 (DDR multiplier) * 448 (bit width/pin count of the memory interface) / 8 (bits per byte) = 111888 Mb/s

u35tpus · March 30, 2010, 11:37am

Could you please explain why x20 multiplier was used for constant M. My understanding is that x240 is the number of multiprocessors (SMs) on GPU board, but what is x20? Does it correlate somehow with maximum number of active warps per SM = 24?

Topic		Replies	Views
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30602	October 29, 2010
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	15214	March 24, 2018
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1117	July 29, 2010
CUDA reduction CUDA Programming and Performance	10	51529	June 7, 2009
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11451	May 23, 2010
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2354	August 12, 2009
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1665	September 29, 2009
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6601	April 13, 2009
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7929	July 22, 2010
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3733	February 1, 2016

My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi

Related topics