ptxas compiles my program wrong CUDA 4.0RC2

I’m trying to optimize my function with inline ptx assembly but I receive wrong results.

I guess that my ptx triggers some error (predicate registers overflow?) in ptxas but it doesn’t report any error.

Here is a relevant part of my function:

int start;

		int size;

		start = ichunk / 2;

		size = 1 << (CHUNK_BITS - 1);

		int graphBits = d_graphBits[i];

		int mask = start+threadIdx.x;

		int offset = expand(threadIdx.x,i);

		int submask = (ichunk+(half?(1<<(level-1)):0))&graphBits;

		uint64* p_data = d_data + offset;

		int64 result = 0;

		__shared__ int ind[24];

		if (threadIdx.x<24) {

			ind[threadIdx.x] = 8*((threadIdx.x<<(level-2))+collapse(ichunk,threadIdx.x));

		}

		__syncthreads();

		asm ("{\n\t"

			".reg .pred p1,p2,p3,p4;\n\t"

			".reg .u64 res1,res2,res3,res4;\n\t"

			".reg .u32 t1,t2,t3,t4;\n\t"

			"mov.s64 %0,0;\n\t"

#define PRED_1(j,k) "and.b32 t" #k ",%2,1<<(" #j ");\n\t" "mov.u64 res" #k ",0;\n\t" 

#define PRED_2(j,k) "setp.ne.u32 p" #k ",t" #k ",0;\n\t" 

#define PRED_3(j,k) "ld.shared.u32 t" #k ",[%3+(" #j ")*4];\n\t"

#define PRED_31(j,k)"add.u32 t" #k ",t" #k ",%1;\n\t"

#define PRED_4(j,k) "@p" #k " ld.global.u64 res" #k ",[t" #k "];\n\t" 

#define PRED_5(j,k) "add.s64 %0,%0,res" #k ";\n\t"

#define LOOP4(num,j) PRED_##num (j,1) PRED_##num (j+1,2) PRED_##num (j+2,3) PRED_##num (j+3,4)

#define LOOP_PREDS(j) LOOP4(1,j) LOOP4(2,j) LOOP4(3,j) LOOP4(31,j) LOOP4(4,j) LOOP4(5,j)

			LOOP_PREDS(8)

			LOOP_PREDS(12)

			LOOP_PREDS(16)

			LOOP_PREDS(20)

			"}\n\t"

			:"=l"(result)

			:"r"(p_data),"r"(submask),"r"(ind)

			);

                // unoptimized version of the code above

		int64 result1 = 0;

		for (int j = CHUNK_BITS; j < 24 /* level */; j++) {

			if ((submask & (1<<j)) != 0) {

				result1 += p_data[ind[j]/8];

			} 

		}

		asm("//here");

		if (result != result1) {

			result = result1;

		}

As you see, I calculate result both in inline ptx code and C code. Then I override result of inline ptx. The problem is that the program still returns

incorrect result. If I remove condition result!=result1, the program returns correct result (all my inline ptx code probably optimized out). I checked ptx code generated - the condition is in place. So the problem is not on ptx generation step. You can see generated ptx code in the attachment.
gpu.zip (30.7 KB)

Are you building this for sm_1x or sm_2x ? If the latter, please note that since sm_2x supports generic pointers, any pointers bound to PTX inline assembly are generic pointers that need to be used with generic load instructions, rather than the address space specific load instructions used here. There are address space conversion instructions defined in PTX (see “cvta”), but I have no experience using them.

Your original loop looks straightforward, so I would expect the compiler to do a decent job on it. What particular issue is the use of PTX inline assembly trying to overcome? Please note that to assess the quality of any compiler-generated code one would want to look at the generated machine code rather than the PTX code which is just an intermediate representation. There is a fair bit of optimization happening in translating from PTX down to machine code.

This code is for sm_2x. I removed .shared and .global modifiers and it is working now. Thank you for help!

I was trying to implement instruction level parallelization in my kernel and see whether it would be faster or slower (due to more registers used). I use predicate registers in non-standard way and so can’t write this in C.