Bra.uni optimization Compiler optimization

Dear all,

I am trying to see the effects of replacing ordinary PTX branch instructions with the bra.uni instruction, but this is not giving me any gain, so I would like to check with you guys if I am missing something. I run this kernel below, setting all the elements in the vector v to be the same (50000,0), and passing d = 600,0. Nvcc produces a ptx program with only bra instructions. I manually replace each bra with a bra.uni instruction, and run the kernel again, but I get absolutely no gain. I am using a GForce GTX 470, and the program takes over 2,6 billion cycles to run:

[codebox]

global void dotProductKernel(double* v, double d) {

int begin = clock();

double aux = v[threadIdx.x];

while (d > 0) {

	v[threadIdx.x] = aux;

	while (v[threadIdx.x] > 0) {

		v[threadIdx.x]--;

	}

	d--;

}

int end = clock();

double time = end;

if (end<begin) {

	time += 4294967296;

} 

time -= begin;

v[threadIdx.x] += time + 0.1;

}

[/codebox]

When is this replacement a valid optimization?

Thank you very much,

Bruno

Dear all,

I am trying to see the effects of replacing ordinary PTX branch instructions with the bra.uni instruction, but this is not giving me any gain, so I would like to check with you guys if I am missing something. I run this kernel below, setting all the elements in the vector v to be the same (50000,0), and passing d = 600,0. Nvcc produces a ptx program with only bra instructions. I manually replace each bra with a bra.uni instruction, and run the kernel again, but I get absolutely no gain. I am using a GForce GTX 470, and the program takes over 2,6 billion cycles to run:

[codebox]

global void dotProductKernel(double* v, double d) {

int begin = clock();

double aux = v[threadIdx.x];

while (d > 0) {

	v[threadIdx.x] = aux;

	while (v[threadIdx.x] > 0) {

		v[threadIdx.x]--;

	}

	d--;

}

int end = clock();

double time = end;

if (end<begin) {

	time += 4294967296;

} 

time -= begin;

v[threadIdx.x] += time + 0.1;

}

[/codebox]

When is this replacement a valid optimization?

Thank you very much,

Bruno

I would expect performance gains, according to the PTX manual:

(2008-10-17, SP-03483-001_v1.3, ISA Version 1.3):

"7.5. Divergence of Threads in Control Constructs

All control constructs are assumed to be divergent points unless the
control-flow instruction is marked as uniform, using the .uni suffix.
For divergent control flow, the optimizing code generator
automatically determines points of re-convergence. Therefore, a
compiler or code author targeting PTX can ignore the issue of
divergent threads, but has the opportunity to improve performance by
marking branch points as uniform when the compiler or author can
guarantee that the branch point is non-divergent."

Yet, as we discussed in the Ocelot mailing list, it may be that the GPU hardware already handles non-divergent branches very efficiently. It would be good if someone familiar with the hardware could say something about it.

Fernando

I would expect performance gains, according to the PTX manual:

(2008-10-17, SP-03483-001_v1.3, ISA Version 1.3):

"7.5. Divergence of Threads in Control Constructs

All control constructs are assumed to be divergent points unless the
control-flow instruction is marked as uniform, using the .uni suffix.
For divergent control flow, the optimizing code generator
automatically determines points of re-convergence. Therefore, a
compiler or code author targeting PTX can ignore the issue of
divergent threads, but has the opportunity to improve performance by
marking branch points as uniform when the compiler or author can
guarantee that the branch point is non-divergent."

Yet, as we discussed in the Ocelot mailing list, it may be that the GPU hardware already handles non-divergent branches very efficiently. It would be good if someone familiar with the hardware could say something about it.

Fernando