yes, method 2 can hide piepline latency if 6 warps are issued, moreover
if we invoke 192 threads (6 warps) per block, then Gatt chart is independent of whether read-after-write hazard occurs or not.
Although method 2 is right, but it is tedious to draw Gatt chart under method 2.
I don’t adopt method 2 to draw Gatt chart when compare bandwidth between “float” and “double” in the thread
http://forums.nvidia.com/index.php?showtop…rt=#entry600634.
In that topic, I don’t use pipeline latency but calibrate “index computation†via Block-wise test harness provided by SPWorley.
in the thread http://forums.nvidia.com/index.php?showtopic=103046 , @SPWorley uses one block of 192 threads to calibrate
“how many clocks the operation takes”.
following code is kernel of calibration.
#define ITEST(num) \
ikernel<itest_ ## num, EVALLOOPS> <<<1,192>>>(12345, d_result); \
cudaMemcpy( &h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost); \
printf("I" #num " %4.1lf %s\n", \
8*(h_result-ibase+1.0)/(192*EVALLOOPS*UNROLLCOUNT), \
itest_ ## num ().describe());
@SPWorley’s code uses
(1) 192 threads to hide pipeline latency and
(2) unroll large loop
Question: what is relationship between pipeline latency and SPWorley uses one block of 192 threads to calibrate “how many clocks the operation takes”.
first suppose we want to evaluate operation S1 ( a ← a * b + c ), then we must do S1 large times, say M times.
for i = 1: M
S1 : a <-- a * b + c; // register read-after-write dependence
end
S2 : a <-- a * b + c; // register read-after-write dependence
Then it is easy to plot Gatt chart of above code, just modify Gatt chart in figure 2, repeat operation S1 M times , see figure 3.
figure 3,
if M is large enough, then average execution time of S1 on one SM is about 1 cycle.
(when all 8 SP executes S1 simultaneously, it only needs 1 cycle to complete S1)
similarly if we want to calibrate operation “S1 + S2” in the following code,
for i = 1: M
S1 : a <-- a * b + c; // register read-after-write dependence
S2 : a <-- a * b + c; // register read-after-write dependence
end
S3 : odata[index] <-- a; // write operation
then Gatt chart is figure 4. average execution time of S1+S2 on one SM is about 2 cycle
figure 4,
Average execution time of S1 on one SM = 1 cycle, this means that one warp needs 4 cycle to execute S1 instruction
in average sense. We define method 3 as method 1 but with average execution time of instructions.
Then Gatt chart of following code is figure 5
//Example : execute three instructions S1, S2 and S3 in turn
S1 : a <-- a * b + c; // register read-after-write dependence
S2 : a <-- a * b + c; // register read-after-write dependence
S3 : odata[index] <-- a; // write operation
S4 : a <-- a * b + c;
figure 5,
however if we use method 2, then Gatt chart is figure 6
figure 6,
Observation: difference between method 2 and method 3
(1) Method 3 hides “index computation” in memory latency while method 2 hide “index computation” in pipeline latency.
(2) Space between two read/write operation (red rectangle) is larger in method 3.
However critical timing of method 2 and method 3 are the same, so we can use method 3 to plot Gatt chart.
To sum up, I think that it is reasonable to draw Gatt chart by method 3, which is more simple.