latency and throughput of MAD operation?

Dear all:

From Volkov’s paper,

Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing.

Piscataway, NJ, USA, 2008, IEEE Press.

( can be obtained in the thread http://forums.nvidia.com/index.php?showtopic=89084 )

authors show

(1) average time per instruction in cycles (a, b, c are register) for GPUs in Table 1

Table 1:

(2) average time per instruction in cycles (a, b: register, s[i]: shared memory) in Table 2

Table 2:

I focus on latency and throughput of three instructions on TeslaC1060,

one is “a = a b", one is "a = as[i] + c” and another is “a = a * b + c”.

from decuda program,

“a = a *b” is translated to “MUL dest src1 src2”

“a = a*s[i] + c” is translated to “MAD dest [smem] src2 src3”

“a = a*b + c” is translated to “MAD dest src1 src2 src3” and

From my experiment (show later), I conclude

(1) “a = a*b” has latency = 24.7 cycle and throughput 2.5 ~ 4.1 cycle per warp

(this number is very strange, I will show experimental data later)

(2) “a = a*s[i] + c” has latency 34.6 cycle and throughput 6 cycle per warp.

(this number is good and matches result in Table 2)

(3) “a = a*b + c” has latency 31.5 cycle and throughput = ? (maybe 4 cycle per warp)

latency = 31.5 cycle is larger than 24 cycle in Table 1,

Does anyone have verified these numbers?

The following is my experiment on TeslaC1060.

Step 1: calibrate “a = a*b”

The “CODE 1” sets “a” as register A_reg and “b” as register b_reg and

compute “A_reg = A_reg * b_reg ;” 256 times.

execution configuration is grid(1,1,1) and threads(NUM_THREADS, 1, 1)

where macro NUM_THREADS = number of threads per block.

CODE 1: kernel of “a = a *b”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	float A_reg = data[0];

	float b_reg = data[1];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < 256; j++){			

		A_reg = A_reg * b_reg ;

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 1” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < 256; j++){			

		A_reg = A_reg * b_reg ;

	}

	end_time = clock();

	__syncthreads();

DECUDA 1: “a = a *b” --> “mul $r3 $r3 $r2”

when kernel’s execution, we compute average time over 256 per thread (see “code 2”) and

reports minimum time and maximum time in Table 3.

CODE 2: compute average number of cycles per “a = a*b”

dim3  grid(1, 1, 1);

	dim3  threads(NUM_THREADS, 1, 1);

	MAD_latency<<< grid, threads >>>(data_gpu, timings_gpu);

	cudaThreadSynchronize();

	CUDA_SAFE_CALL(cudaMemcpy(timings_cpu, timings_gpu, (NUM_THREADS+1)*sizeof(timing_stats), cudaMemcpyDeviceToHost));

// show all time report 

  int min_time = timings_cpu[0].end_time - timings_cpu[0].start_time; 

  int max_time = min_time;

	for( int i = 0; i < NUM_THREADS; i++ ){

		int time_gpu = timings_cpu[i].end_time - timings_cpu[i].start_time;		

		if ( min_time > time_gpu ) { min_time = time_gpu; }

		if ( max_time < time_gpu ) { max_time = time_gpu; }

	}

Table 3: result of “a = a*b”

From Table 3, NUM_THREADS=1 reports latency = 24.7 cycle, this is consistent with Volkov’s result.

To compute throughput, we define throughput = (average time of the instruction)/(number of warp)

and show throughput in Table 4.

Table 4: throughput of “a = a*b”

Table 4 shows throughput = 2.5 ~ 4.1 cycle per warp, this is very strange

since we have known 24-cycle pipeline latency can be hidden by 6 warps, see Gatt chart in figure 1

Figure 1: pipeline latency

and if we invoke more than 6 warps, total time for one instruction is corrected as

total time of one “a=a*b” is (4 cycle) x (number of warps), see figure 2

Figure 2: total time of one “a=a*b”

However when threads = 512, throughput is 2.5 cycle per warp, this is much smaller than 4 cycle per warp.

Remark: From test harness of @SPWorley in the thread

http://forums.nvidia.com/index.php?showtop…rt=#entry570370,

SPWorley uses one block of 192 threads to calibrate “how many clocks it takes”.

Under 192 threads, I will say throughput of “a=a*b” is 4.1 cycle per warp.

Step 2: calibrate “a = a*s[i] + c”

The “CODE 3” sets “a” as register A_reg and “s[i]” as shared memory b[i] and

“c” as register c_reg.

CODE 3: kernel of “a = a*s[i] + c”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	__shared__ float b[BLOCKSIZE];

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	

	for( int j = threadNum; j < 16; j+=NUM_THREADS){			

		b[j] = data[j];

	}  

  __syncthreads();

	float A_reg = data[0];

	float c_reg = data[2];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < 16; j++){			

#pragma unroll 

		for( int i = 0; i < 16; i++){		

			A_reg = A_reg * b[i] + c_reg;

		}

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 2” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < 16; j++){			

#pragma unroll 

		for( int i = 0; i < 16; i++){		

			A_reg = A_reg * b[i] + c_reg;

		}

	}

	end_time = clock();

	__syncthreads();

DECUDA 2: “a = a*s[i] + c” --> “mad $r3 s[…] $r3 $r2”

experimental result (see Table 5) shows

Latency of “a = a*s[i] + c” = 34.6 cycle

Throughput of “a = a*s[i] + c” is about 6 cycle per warp, this number matches Table 2.

Table 5: result of “a = a*s[i] + c”

Step 3: calibrate “a = a * b + c”

The “CODE 4” sets “a, ,b, c” as register A_reg, b_reg and c_reg.

CODE 4: kernel of “a = a * b + c”

static __global__ void MAD_latency(float * data,  timing_stats * timings)

{		

	int threadNum = threadIdx.x;  

  unsigned int start_time = 1664;

  unsigned int end_time = 1664;

	float A_reg = data[0];

	float b_reg = data[1];

	float c_reg = data[2];

	__syncthreads();

	start_time = clock();

#pragma unroll 

	for( int j = 0; j < BLOCKSIZE * MAXITE; j++){			

		A_reg = A_reg * b_reg + c_reg;

	}

	end_time = clock();

	__syncthreads();

	timings[threadNum].start_time = start_time;

	timings[threadNum].end_time = end_time;	

	if ( 0 == threadNum ){	

		timings[NUM_THREADS].start_time = (int)A_reg;

	}

}

and “DECUDA 3” shows decuda’s result of

start_time = clock();

#pragma unroll 

	for( int j = 0; j < BLOCKSIZE * MAXITE; j++){			

		A_reg = A_reg * b_reg + c_reg;

	}

	end_time = clock();

	__syncthreads();

DECUDA 3: “a = a * b + c” --> “mad $r4 $r4 $r3 $r2”

Experimental result (see Table 6) shows latency = 31.5 cycle,

however minimum time and maximum time are much different when NUM_THREADS > 256, I don’t know why.

if we focus on NUM_THREADS=192, 224, 256, then throughput is 4 cycle per warp,

and we need 8 warps to hide pipeline latency (=31.5 cycle)

Table 6: result of “a = a * b + c”