latency and throughput of MAD operation?

LSChien · December 10, 2009, 8:54am

Dear all:

From Volkov’s paper,

Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC â€™08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing.

Piscataway, NJ, USA, 2008, IEEE Press.

( can be obtained in the thread http://forums.nvidia.com/index.php?showtopic=89084 )

authors show

(1) average time per instruction in cycles (a, b, c are register) for GPUs in Table 1

Table 1:

(2) average time per instruction in cycles (a, b: register, s[i]: shared memory) in Table 2

Table 2:

I focus on latency and throughput of three instructions on TeslaC1060,

one is “a = a b", one is "a = as[i] + c” and another is “a = a * b + c”.

from decuda program,

“a = a *b” is translated to “MUL dest src1 src2”

“a = a*s[i] + c” is translated to “MAD dest [smem] src2 src3”

“a = a*b + c” is translated to “MAD dest src1 src2 src3” and

From my experiment (show later), I conclude

(1) “a = a*b” has latency = 24.7 cycle and throughput 2.5 ~ 4.1 cycle per warp

(this number is very strange, I will show experimental data later)

(2) “a = a*s[i] + c” has latency 34.6 cycle and throughput 6 cycle per warp.

(this number is good and matches result in Table 2)

(3) “a = a*b + c” has latency 31.5 cycle and throughput = ? (maybe 4 cycle per warp)

latency = 31.5 cycle is larger than 24 cycle in Table 1,

Does anyone have verified these numbers?

The following is my experiment on TeslaC1060.

Step 1: calibrate “a = a*b”

The “CODE 1” sets “a” as register A_reg and “b” as register b_reg and

compute “A_reg = A_reg * b_reg ;” 256 times.

execution configuration is grid(1,1,1) and threads(NUM_THREADS, 1, 1)

where macro NUM_THREADS = number of threads per block.

CODE 1: kernel of “a = a *b”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	float A_reg = data[0];

	float b_reg = data[1];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < 256; j++){			

		A_reg = A_reg * b_reg ;

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 1” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < 256; j++){			

		A_reg = A_reg * b_reg ;

	}

	end_time = clock();

	__syncthreads();

DECUDA 1: “a = a *b” → “mul $r3 $r3 $r2”

when kernel’s execution, we compute average time over 256 per thread (see “code 2”) and

reports minimum time and maximum time in Table 3.

CODE 2: compute average number of cycles per “a = a*b”

dim3  grid(1, 1, 1);

	dim3  threads(NUM_THREADS, 1, 1);

	MAD_latency<<< grid, threads >>>(data_gpu, timings_gpu);

	cudaThreadSynchronize();

	CUDA_SAFE_CALL(cudaMemcpy(timings_cpu, timings_gpu, (NUM_THREADS+1)*sizeof(timing_stats), cudaMemcpyDeviceToHost));

// show all time report 

  int min_time = timings_cpu[0].end_time - timings_cpu[0].start_time; 

  int max_time = min_time;

	for( int i = 0; i < NUM_THREADS; i++ ){

		int time_gpu = timings_cpu[i].end_time - timings_cpu[i].start_time;		

		if ( min_time > time_gpu ) { min_time = time_gpu; }

		if ( max_time < time_gpu ) { max_time = time_gpu; }

	}

Table 3: result of “a = a*b”

From Table 3, NUM_THREADS=1 reports latency = 24.7 cycle, this is consistent with Volkov’s result.

To compute throughput, we define throughput = (average time of the instruction)/(number of warp)

and show throughput in Table 4.

Table 4: throughput of “a = a*b”

Table 4 shows throughput = 2.5 ~ 4.1 cycle per warp, this is very strange

since we have known 24-cycle pipeline latency can be hidden by 6 warps, see Gatt chart in figure 1

Figure 1: pipeline latency

and if we invoke more than 6 warps, total time for one instruction is corrected as

total time of one “a=a*b” is (4 cycle) x (number of warps), see figure 2

Figure 2: total time of one “a=a*b”

However when threads = 512, throughput is 2.5 cycle per warp, this is much smaller than 4 cycle per warp.

Remark: From test harness of @SPWorley in the thread

http://forums.nvidia.com/index.php?showtop…rt=#entry570370,

SPWorley uses one block of 192 threads to calibrate “how many clocks it takes”.

Under 192 threads, I will say throughput of “a=a*b” is 4.1 cycle per warp.

Step 2: calibrate “a = a*s[i] + c”

The “CODE 3” sets “a” as register A_reg and “s[i]” as shared memory b[i] and

“c” as register c_reg.

CODE 3: kernel of “a = a*s[i] + c”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	__shared__ float b[BLOCKSIZE];

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	

	for( int j = threadNum; j < 16; j+=NUM_THREADS){			

		b[j] = data[j];

	}  

  __syncthreads();

	float A_reg = data[0];

	float c_reg = data[2];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < 16; j++){			

#pragma unroll 

		for( int i = 0; i < 16; i++){		

			A_reg = A_reg * b[i] + c_reg;

		}

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 2” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < 16; j++){			

#pragma unroll 

		for( int i = 0; i < 16; i++){		

			A_reg = A_reg * b[i] + c_reg;

		}

	}

	end_time = clock();

	__syncthreads();

DECUDA 2: “a = a*s[i] + c” → “mad $r3 s[…] $r3 $r2”

experimental result (see Table 5) shows

Latency of “a = a*s[i] + c” = 34.6 cycle

Throughput of “a = a*s[i] + c” is about 6 cycle per warp, this number matches Table 2.

Table 5: result of “a = a*s[i] + c”

Step 3: calibrate “a = a * b + c”

The “CODE 4” sets “a, ,b, c” as register A_reg, b_reg and c_reg.

CODE 4: kernel of “a = a * b + c”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	float A_reg = data[0];

	float b_reg = data[1];

	float c_reg = data[2];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < BLOCKSIZE * MAXITE; j++){			

		A_reg = A_reg * b_reg + c_reg;

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 3” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < BLOCKSIZE * MAXITE; j++){			

		A_reg = A_reg * b_reg + c_reg;

	}

	end_time = clock();

	__syncthreads();

DECUDA 3: “a = a * b + c” → “mad $r4 $r4 $r3 $r2”

Experimental result (see Table 6) shows latency = 31.5 cycle,

however minimum time and maximum time are much different when NUM_THREADS > 256, I don’t know why.

if we focus on NUM_THREADS=192, 224, 256, then throughput is 4 cycle per warp,

and we need 8 warps to hide pipeline latency (=31.5 cycle)

Table 6: result of “a = a * b + c”

Topic		Replies	Views
Instruction Latency CUDA Programming and Performance	18	43775	January 18, 2010
latency of shared memory of Tesla C1060 CUDA Programming and Performance	0	1239	December 8, 2009
How to understand the "hide latency" CUDA Programming and Performance	13	3593	August 8, 2024
Questin regarding latency CUDA Programming and Performance	6	4249	August 26, 2010
How many parallel threads? CUDA Programming and Performance	19	10080	October 1, 2021
Instruction timings More info than in the guide CUDA Programming and Performance	5	8287	May 21, 2007
How to accurately time individual memory operations CUDA Programming and Performance	17	6271	September 12, 2016
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15614	February 4, 2011
Throughput and latency of mma.sync instruction CUDA Programming and Performance	5	175	September 2, 2024
Fast CUDA implementation for calculating cross-norm-distance of two matrices CUDA Programming and Performance cuda	0	938	April 8, 2021

latency and throughput of MAD operation?

Related topics