Pointer jumping technique, too fast to be true ?!

Hello,

I’ve implemented some simple code to find the root of everyone in a forest (a set of trees).

The forest is basically an array where every element points to its father.

The roots of each tree point to themselves.

My kernel only operates in global memory and is basically

__global__ void findroots(int * forest, int n){

    int ix     =  threadIdx.x + blockIdx.x*blockDim.x;
    
    if (ix < n){
    
        while(forest[ix] != forest[forest[ix]]){
            forest[ix] = forest[forest[ix]];
        }
    
    }
}

In a nutshell, the kernel makes each thread/position update itself with the value of its father’s father’s value. Unless there is nothing to do.

I’ve timed this against a standard serial solution: first identify the roots of each tree (those nodes that point to themselves), second convert the array of parents into an adjacency list representation of the forest, finally, use DFS on this list representation to find all nodes attached to each tree’s root node. Once this is done, we know the root of all the nodes in the forest.

The CPU code is standard but long so I’m not posting it here.

The GPU code is at least 500 times faster than the CPU code (just the kernel time vs. cpu-process time, men copies ignored).

I am a bit suspicious that this might be too good to be true.

In particular, since I am operating in global memory and I have multiple blocks running, shouldn’t I have some threadfences() or syncs() or volatiles in my code to ensure it is doing the right thing? Shouldn’t I be worried about values being stored in L1 cache that are not correct ? Or values in L1 not having been written to global memory ?

The code does work, I’ve checked. But I find it a bit strange.

Before you do anything else, write a test that confirms the code is actually behaving as you intend. You have many questions about correctness that would be best answered by you. If you need a C++ testing framework, I recommend using something like Catch. It’s header-only and is simple to use.

maybe you’re not timing the code correctly (you might just be timing the kernel launch, not the execution time). Maybe your kernel isn’t running at all, because you have a launch error. Impossible to say from what you have posted.

Hello Txbob,

Here is how I am timing my code. I think it is correct…

cudaEventRecord(gpu_start, 0);
findroots<<<(n+1024)/1024,1024>>>(d_forest, n);
cudaEventRecord(gpu_end, 0);
cudaEventSynchronize(gpu_end); 
cudaEventElapsedTime(&gputime, gpu_start, gpu_end);

I also check the input/output relation of my kernel. It is doing what it is supposed to do.

You admit your CPU code is “standard” but it’s using a totally different and significantly more complex algorithm to compute the forest roots. That algorithm may therefore have very different performance than a simple pointer-chase algorithm like you’ve implemented on the GPU. Whether faster or slower is very sensitive to the input forest itself.

For a (better) CPU<->GPU comparison, implement the same simple pointer-chase method on the CPU. That behavior will likely be much more similar to the GPU… and it’s hard to say if it’d be faster or slower than the GPU since the GPU has more memory bandwidth but also more latency, which does matter for pointer chasing.

I would also generate a dozen wildly different input forests ranging in both size and complexity, including corner cases like all-roots (great for the trivial algorithm), or all one single chain with scrambled order (worst case for the trivial GPU pointer chase algorithm.) Different size inputs may also show cache effects.

Going to your actual GPU code, you do have a memory race hazard since threads are reading memory that other threads are simultaneously writing to. BUT, and this is surprising and uncommon, in your very specific implementation that race might be harmless since the result will still be correct whether the thread reads either the old value or the new value.

I would at least throw in a “volatile” to ensure stale values don’t survive for more than one loop iteration.
Careful analysis would indeed be needed to find out whether that is sufficient. Using “atomicCas()” and forcing one more iteration if differing might help as well.

Hello SPWorley,

Yes, I think want you say is true…the code does seem to be formally correct whether we read the old value or the new value.
In either case, we always make progress towards the root of the particular tree we are dealing with. It might just be a particular thing about this algorithm. Is it possible that when I read the memory I get an undefined value? Or do we always get a good value (whether old or new value doesn’t matter).

On a different note, do you know of other algorithms of this kind? In which we do not need to care about race conditions?

Yes, it makes sense to try the same kind of solution in the CPU. I will try that next.

Hello tera,

I have tried using volatile. It slows down the code substantially and it works just as fine as the code without volatile.

__global__ void findrootofforests_volatile(int * fst, int n){

  int ix     =  threadIdx.x + blockIdx.x*blockDim.x;
    
    volatile int *pointer_fst = fst;
    
    if (ix < n){
    
        while(pointer_fst[ix] != pointer_fst[pointer_fst[ix]]){
            pointer_fst[ix] = pointer_fst[pointer_fst[ix]];
        }
    
    }
}

Thanks !

So many questions about correctness… So little tests… So many attempts at optimization…

I just… This is why we can’t have nice things!

This code seems similar to “list ranking”. It basically accelerates the pointer chasing problem/work exponentially.

Each processor works on chasing a pointer, since all processors do the same and since they share the results the pointer chasing problem is solved exponentially, that’s basically what you are seeing.