[Solved]CUDA inline PTX Internal Compiler Error

I want to measure the cache behavior of gpu global memory and below is the micro-benchmark that i design. What i want to do is to load from global memory address r_add0 and store it into shared memory s_tvalue[0]. For some reason, i need to replace the loading instruction from global memory with inline PTX code.

i = *r_addr0;
//asm("ldu.global.f64.cs %1, [%2];":"=l"(i):"l"(r_addr0));
s_tvalue[0] = i;

However, when i compile it with nvcc, it complaint with compile error:

error: Internal Compiler Error (codegen): "asm operand index requested is larger than the number of asm operands provided!"

Does anybody knows the bug of my codes.

Complete codes see below:

__global__ void global_latency (long long * my_array, long long array_length, int position, long long *d_time) {

unsigned int start_time, end_time;

__shared__ long long s_tvalue[2];//2: number of threads per block

int k;
long long i, j;
for(k=0; k<2; k++)
    s_tvalue[k] = 0L;
long long addr0,addr1;

addr0=(long long)my_array;

addr1 = ( addr0 ^ (1 << position));

long long *r_addr0, *r_addr1;
r_addr0 = (long long *)addr0;
r_addr1 = (long long *)addr1;

start_time = clock();
//i = *r_addr0;
asm("ldu.global.f64.cs %1, [%2];":"=l"(i):"l"(r_addr0));

s_tvalue[0] = i;
//j = *r_addr1;
asm("ld.global.f64.cs %3, [%4];" : "=l"(j):"l"(r_addr1));
s_tvalue[1] = j;


end_time = clock();

d_time[0] = end_time-start_time;
d_time[1] = s_tvalue[0];
printf("[%p]=%lld\n",addr0,d_time[1]);
d_time[2] = s_tvalue[1];
printf("[%p]=%lld\n",addr1,d_time[2]); 
}

Counting starts at 0, not 1. Try using %0,%1 instead of %1,%2 (the %2 presumably triggers the index-out-of-bounds error).

Thanks.
Although there are some other minor errors, you point out the most important one.