Some problems with inline PTX

Hi everyone, I just got 2 problems waiting to be solved. Firstly, I just wrote the following code:

#include <stdio.h>;
#include <stdlib.h>;
#include <iostream>;
#include <cuda.h>;

using namespace std;

#define BLOCK_SIZE 32
#define GRID_SIZE 14

// Kernel definition 
__global__ void VecAdd(float* A, float* B, float* C) {
        int i = threadIdx.x;
        int j = blockIdx.x;
        int n = 32*j+i;

        /*for (int j = 0 ; j < 100000000 ; j++ ){
                C[i] = A[i] + B[i];
        }*/

        asm(".reg .f32	t1;"                // temp reg t1 t2 t3
            ".reg .f32	t2;"
            ".reg .f32	t3;"
            "ld.global.f32	t1, [%0];	"      // t1 =
            : : "r" (&A[n]));

        asm("ld.global.f32	t2, [%0];	"      // t2 =
            : : "r" (&B[n]));

        for (int m = 0 ; m < 10000000 ; m++) {
                asm("add.f32	t3, t1, t2;"
                    : :);
        }

        asm("mov.f32	%0, t3;" : "=f"(C[n]));

}

int main() {
        size_t size = BLOCK_SIZE*GRID_SIZE*sizeof(float);

        float* A = (float*)malloc(size);
        float* B = (float*)malloc(size);
        float* C = (float*)malloc(size);

        for (int i = 0 ; i < BLOCK_SIZE*GRID_SIZE ; ++i) {
                A[i]=1.0;
                B[i]=3.0;
        }

        float* gA;
        cudaMalloc((void**)&gA,size);
        cudaMemcpy(gA,A,size,cudaMemcpyHostToDevice);

        float* gB;
        cudaMalloc((void**)&gB,size);
        cudaMemcpy(gB,B,size,cudaMemcpyHostToDevice);

        float* gC;
        cudaMalloc((void**)&gC,size);

        dim3 grid(GRID_SIZE , 1 , 1);
        dim3 block(BLOCK_SIZE , 1 , 1); 

        // Kernel invocation 
        VecAdd<<<grid , block>>>(gA, gB, gC);

        cudaMemcpy(C,gC,size,cudaMemcpyDeviceToHost);

        for (int j=0;j<BLOCK_SIZE*GRID_SIZE;++j) {
                printf("%f	%d",C[j],j);
        }

        cudaFree(gA);
        cudaFree(gB);
        cudaFree(gC);

        free(A);
        free(B);
        free(C);

        return 0;
}

1, I found that there is not a positive correlation between the running time of the program and the size of the ‘for’ loop as the following code:

for (int m = 0 ; m < 10000000 ; m++) {
                asm("add.f32	t3, t1, t2;"
                    : :);
        }

But if I use the following ‘for’ loop instead of the loop above(using inline PTX assembly):

for (int j = 0 ; j < 100000000 ; j++ ){
             C[i] = A[i] + B[i];
     }

There is a a positive correlation between the running time and the size of the loop. So the program with inline PTX is wrong?? How could I make it right??

2, I can’t use some function like sleep in kernal. So how could I do if I want to make the program stop for a while before running the ‘for’ loop in the kernal??

Please help me or give me some hints!! Thank you!!

My email address is zhaoq0563@gmail.com. Welcome to send me email talking about these problems!

Debug this by generating the PTX for your working CUDA C version, then compare it to the PTX generated when you’re using the asm calls. There will be a mismatch. Using that difference, you can probably spot if its a problem with your asm or a compiler issue. Inline asm doesn’t get (much) syntax checking from nvcc, so it’s easy to make even simple typos which the compiler does not catch, and even ptax might not.

An interesting issue may be how optimization passes are recognized… I’m not sure how smart ptax is about identifying dead code, but likely it can’t “see” the fact that your loop is useless. Of course this is confusing since you say that the inline PTX does not scale with loop count (which could be explained by the loop being optimized away) but the C code does. But if there is an optimization, I’d expect the opposite with the C code optimized (and constant speed) and the PTX scaling linearly. But this is a side issue… you just need to examine the PTX to start debugging your main question of why the codes differ.

Question number 2:
There is no sleep function. The best way is to exit the kernel, wait on an event, and launch a new kernel.
This is the only “legit” way to pause a kernel.

If you really needed to “pause” a kernel and do some extra compute, sometimes sm_35 dynamic parallelism can work. It depends on why you need to pause the kernel! A dynamic launch will indeed halt your block and restore it after the launch has returned, something a hostside launch can’t do.

BUT: if you are doing evil sneaky unsupported evil spinloops or ugly hacks that will get frowned apon… it can be useful to know that the SLOWEST function you can call which will stall a warp the longest is __threadfence_system(). This long pause is sometimes useful, as a terrible hack, to optimize spinwait performance. But don’t use spinwaits. They’re not the right paradigm for CUDA.

Note that your PTX-based loop is semantically very different from your CUDA-C based loop. The former repeats the same addition of two registers 1 million times, while the latter adds two 1 million element arrays. The compiler is presumably smart enough to recognize the redundancy of repeating the same addition 1 million times, so you likely get exactly one addition regardless of the trip count of the loop, which would explain why the runtime doesn’t change. As SPWorley suggests, check the generated machine code with cuobjdump --dump-sass.

[later:] On second thought: Why does the PTX version of the code has a loop at all? It seems you parallelized the vector addition (where is the magic 32 coming from, by the way?), so no loop should be required.

I am not sure that the way you share register bindings across multiple asm() statements works as intended. As far as I understood (but I could be wrong) gcc-style inline assembly requires the binding of all symbolic register references for each asm() statement. So I would suggest you make a single asm() statement that includes the loop, then bind to the addresses of the three arrays as required. Note that if you want to keep your code portable, you will need to #ifdef the binding of addresses because pointers are 64-bit and bound with “l” on 64-bit platforms.

I think your thought is wrong. The latter didn’t add two 1 million element arrays. It is also add the same two number for 1 million times.

Question 1:
I’ll try.

Question 2:
I tried your method. However, it seems that the second kernal didn’t recognize those register loaded in the first kernal when entering the second kernal. The only reason I wanna stop the kernal running for a while is that I want to run the add instruction many times without other instructions influencing it before or after the add instruction…

I may have misunderstood what you are trying to accomplish. My comment about semantic differences pertained to these two loop:

for (int m = 0 ; m < 10000000 ; m++) {
    asm("add.f32	t3, t1, t2;
         : :);
}
for (int j = 0 ; j < 100000000 ; j++ ){
     C[i] = A[i] + B[i];
}

The first loop simply repeats the same addition 1 million times, and is equivalent to doing a single addition. Thus execution time would be invariant to trip count. The second loop adds two arrays of a million elements, clearly the work (and thus execution time) is proportional to the trip count.

The CUDA compiler will aggressively remove dead code and eliminate redundancies. So you need an unbroken dependency chain from a global memory or kernel argument input to a global memory output, and also need to avoid other optimiztions. For example:

int a = global1[tid];
int sum = 0;
for (int i = 0; i < N; i++) {
   sum += a;
}
global2[tid] = sum;

This code has an unbroken dependency chain from input to output, but the compiler can optimize to:

global2[tid] = N * global1[tid];

which is presumably not what you want (although I am still not clear what exactly you are trying to achieve).

As for your question 2, I think your observation confirms what I stated above about the necessity to bind registers for every asm() statement.