Shuffle instructions

Hi,
I have a question: As far as I know, the shuffle instructions are used mostly when the lane id can be known at compilation time. Now, I have to get the lane id at runtime, how to guarantee correctness?

The code looks like this:

int item_index = ch * NUM_STATES + all_states[s];

//which lane stores the state?
int target_lane = item_index % WARP_SIZE;

//what is the index for that lane?
int target_index = item_index / WARP_SIZE;

all_states[s] = __shfl_sync(0xffffffff, local_table[target_index], target_lane, WARP_SIZE);

Target lane is the lane id for shuffle, which is calculated at runtime. local_table is an array, which is declared like"int local_table[N];", besides, local_table is read-only. I never update the value of local_table.

When I run my program, I found that the result of __shfl_sync is incorrect.

Thanks a lot for any help!

the source lane does not calculate the index used by the target lane. The target lane computes that index, to determine what it will provide.

If that doesn’t help,then I would suggest providing a complete test case (not your whole code) as well as the results you get and the results you are expecting.

Hi,
Thanks a lot for your kind reply!
Here is the code

//Load transit table into register cache
    //notice that it is read-only
    int local_table[NUM_PER_THREAD];

    //each lane store NUM_PER_THREAD items of the transit table into registers
    //int start = NUM_PER_THREAD * lane;
    int index = lane;
    for (int i = 0; (i < NUM_PER_THREAD) && (index < NUM_TRANSIT); i++) {
        local_table[i] = trans_table[index];
        index += WARP_SIZE;
    }

    /*if((lane == 22)&&(offset < 100)){
        printf("the element is %d\n", local_table[1]);
    }*/



    //Initialize all states
    int all_states[NUM_STATES];
#pragma unroll
    for (int s = 0; s < NUM_STATES; s++) {
        all_states[s] = s;
    }


    int start_index = offset * num_per_thread;

    //number of items to process
    int num = num_per_thread;
    if (offset == num_thread - 1) {
        num = num_entry - start_index;
    }

    for (int j = 0; j < num; j++) {
        int index = start_index + j;
        int ch = input[index];

#pragma unroll
        for (int s = 0; s < NUM_STATES; s++) {

            int item_index = ch * NUM_STATES + all_states[s];

            //which lane stores the state?
            int target_lane = item_index % WARP_SIZE;

            //what is the index for that lane?
            int target_index = item_index / WARP_SIZE;

            all_states[s] = __shfl_sync(0xffffffff, local_table[target_index], target_lane, WARP_SIZE);

            /*if(offset < 10) {
                printf("offset is %d, ch is %d, s is %d, index is %d, all_states[s] is %d\n", offset, ch, s, item_index, all_states[s]);
            }*/
        }
    }

The result is that the value of all_states[s] is incorrect. For example, I checked the the second element of 22nd lane , it stores 1, however, when I try to read the same value, I get 10. I guess there is a synchronization problem here.

Thanks a lot!

Anyone can give me some insights or suggestions?
Thanks a lot!

I suggest providing a test case that I can copy, paste, compile and run, without having to add anything or change anything. Put a printf in there to indicate the output you don’t understand.

It contains several files in different folders.

I see one statement:
“When Should You Use Register Caching?
There are cases where the register cache is not applicable. First, the access pattern should be known at compile time…”

In this post.https://devblogs.nvidia.com/register-cache-warp-cuda/

Thanks a lot!

I think “the access pattern should be known at compile time” means that shuffle instructions can not guarantee correctness when the lane id is calculated at runtime. Is my understanding correct?
Thanks a lot!

No, not correct. Register caching there refers to the idea that the compiler can put certain array-style access patterns into registers, if the access indexes can be computed at compile time.

It’s not directly related to warp shuffle. The warp shuffle parameters can all be computed at runtime.