Hi,
I have a question regarding whether nvcc can do loop invariant optimization.
I have such a code snippet in a larger computation:
...
#pragma unroll
for (int kw_inner = 0; kw_inner < 7; ++kw_inner) {
#pragma unroll
for (int occ_c = 0; occ_c < 2; ++occ_c) {
#pragma unroll
for (int oc_block_c = 0; oc_block_c < 4; ++oc_block_c) {
#pragma unroll
for (int ic_block = 0; ic_block < 3; ++ic_block) {
if (((((int)blockIdx.z) * 2) + (((int)threadIdx.z) / 8)) < n) {
...
The if condition doesn’t involve for loop variables, and should be able to move up to top level:
...
if (((((int)blockIdx.z) * 2) + (((int)threadIdx.z) / 8)) < n) {
#pragma unroll
for (int kw_inner = 0; kw_inner < 7; ++kw_inner) {
#pragma unroll
for (int occ_c = 0; occ_c < 2; ++occ_c) {
#pragma unroll
for (int oc_block_c = 0; oc_block_c < 4; ++oc_block_c) {
#pragma unroll
for (int ic_block = 0; ic_block < 3; ++ic_block) {
...
The second kernel can achieve 2x performance comparing to the first one. Doesn nvcc do such kind of optimization?
My nvcc info:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176