I’m trying to optimize my function with inline ptx assembly but I receive wrong results.
I guess that my ptx triggers some error (predicate registers overflow?) in ptxas but it doesn’t report any error.
Here is a relevant part of my function:
int start;
int size;
start = ichunk / 2;
size = 1 << (CHUNK_BITS - 1);
int graphBits = d_graphBits[i];
int mask = start+threadIdx.x;
int offset = expand(threadIdx.x,i);
int submask = (ichunk+(half?(1<<(level-1)):0))&graphBits;
uint64* p_data = d_data + offset;
int64 result = 0;
__shared__ int ind[24];
if (threadIdx.x<24) {
ind[threadIdx.x] = 8*((threadIdx.x<<(level-2))+collapse(ichunk,threadIdx.x));
}
__syncthreads();
asm ("{\n\t"
".reg .pred p1,p2,p3,p4;\n\t"
".reg .u64 res1,res2,res3,res4;\n\t"
".reg .u32 t1,t2,t3,t4;\n\t"
"mov.s64 %0,0;\n\t"
#define PRED_1(j,k) "and.b32 t" #k ",%2,1<<(" #j ");\n\t" "mov.u64 res" #k ",0;\n\t"
#define PRED_2(j,k) "setp.ne.u32 p" #k ",t" #k ",0;\n\t"
#define PRED_3(j,k) "ld.shared.u32 t" #k ",[%3+(" #j ")*4];\n\t"
#define PRED_31(j,k)"add.u32 t" #k ",t" #k ",%1;\n\t"
#define PRED_4(j,k) "@p" #k " ld.global.u64 res" #k ",[t" #k "];\n\t"
#define PRED_5(j,k) "add.s64 %0,%0,res" #k ";\n\t"
#define LOOP4(num,j) PRED_##num (j,1) PRED_##num (j+1,2) PRED_##num (j+2,3) PRED_##num (j+3,4)
#define LOOP_PREDS(j) LOOP4(1,j) LOOP4(2,j) LOOP4(3,j) LOOP4(31,j) LOOP4(4,j) LOOP4(5,j)
LOOP_PREDS(8)
LOOP_PREDS(12)
LOOP_PREDS(16)
LOOP_PREDS(20)
"}\n\t"
:"=l"(result)
:"r"(p_data),"r"(submask),"r"(ind)
);
// unoptimized version of the code above
int64 result1 = 0;
for (int j = CHUNK_BITS; j < 24 /* level */; j++) {
if ((submask & (1<<j)) != 0) {
result1 += p_data[ind[j]/8];
}
}
asm("//here");
if (result != result1) {
result = result1;
}
As you see, I calculate result both in inline ptx code and C code. Then I override result of inline ptx. The problem is that the program still returns
incorrect result. If I remove condition result!=result1, the program returns correct result (all my inline ptx code probably optimized out). I checked ptx code generated - the condition is in place. So the problem is not on ptx generation step. You can see generated ptx code in the attachment.
gpu.zip (30.7 KB)