[SOLVED] PTX assembly, mixing operand types

saulocpp · May 30, 2019, 9:59pm

I am trying a few things and have the useless code below, which just populates a managed array with the value of “i”, and in the kernel each thread will add its index value “tid”. i and tid are the same value, but one is indexing in host and one is in the device.

#include <iostream>

using namespace std;

__global__ void asm_kernel(float *data, const int LEN)
    {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    float tidf = static_cast <float> (tid);

    while(tid < LEN)
        {
        asm(
            "add.f32 %0, %1, %2;"
            : "=f"(data[tid])
            : "f"(data[tid]), "f"(tidf)
            );
        tid += gridDim.x * blockDim.x;
        tidf = static_cast <float> (tid);
        }
    }

int main(void)
    {
    const int LEN = 2000;
    float *data;

    cudaMallocManaged(&data, LEN * sizeof(float));
    cudaMemset(data, 0, LEN * sizeof(float));

    for(int i = 0; i < LEN; i++)
        data[i] = i;

    asm_kernel <<<4, 256>>> (data, LEN);
    cudaDeviceSynchronize();

    for(int i = 0; i < LEN; i++)
        cout << data[i] << endl;

    cudaFree(data);

    return 0;
    }

If in line #15 I replace “f”(tidf) with “r”(tid), that is, passing the integer “tid” instead of the float “tidf”, the value is not added, as the printing on the screen shows.
Is it illegal to use at least one of the operands that is not of the type of the operation even if the type is explicitly specified?

cbuchner1 · May 31, 2019, 9:30am

“r”(tid) results in the binary pattern stored in register holding the integer thread ID
to be interpreted as an IEEE 754 single precision floating point number.

it is about the same effect as using the CUDA intrinsic __int_as_float()

try printing the value of __int_as_float(1) with printf(“%f”) to see what you get.

My guess is that you would be adding up extremely tiny float values that get rounded to 0
when you do a printf. You can use %e in the printf to get scientific notation instead
to see how tiny the number really is.

the proper way is to convert the integer number to the closest floating point representation
doing something similar to this before passing it into the add instruction. The .rn means
“round to nearest”

cvt.rn.f32.s32 %2, %2

The static_cast would have done that for you automatically.

Christian

saulocpp · May 31, 2019, 11:24am

Christian, thanks for your explanation.
It makes sense and I can see that your suggestion indeed shows 0.00000 regardless if I pass “1” or “20000” to __int_as_float().
In addition to the .rn you showed, I also tried the static_cast while passing the argument, so the line #15 looks like:

: "f"(data[tid]), "f"(static_cast <float> (tid))

It works, so in this case I wouldn’t need the extra float variable “tidf” as the integer “tid” is being converted while passed as operand.
Performance-wise I have no idea about the cost of a static_cast in this asm() call as opposed to using the PTX directives for the conversion. But the purpose of the test is to familiarize with the thing and you made it clear how it can be done.
Thanks a lot.