Problem about inline PTX code in CUDA program

Dear all,

I got a problem about inline PTX code in CUDA program.

The “Using Inline PTX Assembly in CUDA” pdf tells the syntax of the “asm()” statement. Withint the pdf all the examples are passing the parameters as input parameters to the instructions. However, what should I do if I want to pass the address to the instructions in the “asm” statement?

Here is my code:(“a” is a float parameter)

asm(".reg .f32 t1;\n\t" // temp reg f1
“ld.global.f32 t1, %0;\n\t” // f1 =
: : “f” (a));

The ptxas will report the error like this:
ptxas /tmp/tmpxft_00003efc_00000000-2_test_float.ptx, line 68; error : Address expected for argument 1 of instruction ‘ld’

Is there anybody know how to deal with that? Thanks a lot!!!

Here is a quick example program that shows how to use loads in inline PTX. This PTX code uses generic loads, so assumes compute capability >= 2.0. The output on my machine is:

~/tmp $ ./ptxload
res= 3.141592741e+00

#include 
#include 

__global__ void kernel (float x, float *res)
{
    float r;
    float *p = &x;
    asm ("ld.f32 %0, [%1];" : "=f"(r) : "l"(p));   // for 64-bit platform
//    asm ("ld.f32 %0, [%1];" : "=f"(r) : "r"(p));   // for 32-bit platform
    *res = r;
}

int main (void) 
{
    float pi = 3.14159265358979323f;
    float res, *res_d;
    cudaMalloc ((void**)&res_d, sizeof(res_d[0]));
    kernel (pi, res_d);
    cudaMemcpy (&res, res_d, sizeof(res), cudaMemcpyDeviceToHost);
    cudaFree (res_d);
    printf ("res= %16.9e
", res);
    return EXIT_SUCCESS;
}

Please note that the forum software swallows everything after the “less than” character. This means the #includes and the kernel launch don’t display correctly. The forum software also eliminates backslashes. So you will have to adjust for that should you decide to cut & paste the code.

Thank you very much!!!

Here is a revised kernel showing both load and store, as requested.

__global__ void kernel (float x, float *res)
{
    float *p = &x;
    asm ("{ 
	"
         ".reg .f32 t; 
	"
         "ld.f32 t, [%0]; 
	" 
         "st.f32 [%1], t; 
	"
         "}"
#if defined(__LP64__) || defined(_WIN64)
         : : "l"(p), "l"(res)
#else
         : : "r"(p), "r"(res)
#endif
        );
}