Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each non- diverged SIMD instruction can produce 32 results every 4 cycles.Assume a kernel that has divergent branches that causes on average 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single-precision arithmetic and 20% are load/store. Since not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.
(1) Compute the throughput, in GFLOP/sec, for this kernel on this GPU.
(2)Assume that you have the following choices:
(1) Increasing the number of single precision lanes to 16
(2) Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors)
(3) Adding a cache that will effectively reduce memory latency by 40%, which will increase
instruction issue rate to 0.95
What is speedup in throughput for each of these improvements?