Shuffle instructions

xia.425 · February 26, 2019, 5:22pm

Hi,
I have a question: As far as I know, the shuffle instructions are used mostly when the lane id can be known at compilation time. Now, I have to get the lane id at runtime, how to guarantee correctness?

The code looks like this:
…
int item_index = ch * NUM_STATES + all_states[s];

//which lane stores the state?
int target_lane = item_index % WARP_SIZE;

//what is the index for that lane?
int target_index = item_index / WARP_SIZE;

all_states[s] = __shfl_sync(0xffffffff, local_table[target_index], target_lane, WARP_SIZE);
…

Target lane is the lane id for shuffle, which is calculated at runtime. local_table is an array, which is declared like"int local_table[N];", besides, local_table is read-only. I never update the value of local_table.

When I run my program, I found that the result of __shfl_sync is incorrect.

Thanks a lot for any help!

Robert_Crovella · February 26, 2019, 6:00pm

the source lane does not calculate the index used by the target lane. The target lane computes that index, to determine what it will provide.

If that doesn’t help,then I would suggest providing a complete test case (not your whole code) as well as the results you get and the results you are expecting.

xia.425 · February 26, 2019, 6:10pm

Hi,
Thanks a lot for your kind reply!
Here is the code

//Load transit table into register cache
    //notice that it is read-only
    int local_table[NUM_PER_THREAD];

    //each lane store NUM_PER_THREAD items of the transit table into registers
    //int start = NUM_PER_THREAD * lane;
    int index = lane;
    for (int i = 0; (i < NUM_PER_THREAD) && (index < NUM_TRANSIT); i++) {
        local_table[i] = trans_table[index];
        index += WARP_SIZE;
    }

    /*if((lane == 22)&&(offset < 100)){
        printf("the element is %d\n", local_table[1]);
    }*/



    //Initialize all states
    int all_states[NUM_STATES];
#pragma unroll
    for (int s = 0; s < NUM_STATES; s++) {
        all_states[s] = s;
    }


    int start_index = offset * num_per_thread;

    //number of items to process
    int num = num_per_thread;
    if (offset == num_thread - 1) {
        num = num_entry - start_index;
    }

    for (int j = 0; j < num; j++) {
        int index = start_index + j;
        int ch = input[index];

#pragma unroll
        for (int s = 0; s < NUM_STATES; s++) {

            int item_index = ch * NUM_STATES + all_states[s];

            //which lane stores the state?
            int target_lane = item_index % WARP_SIZE;

            //what is the index for that lane?
            int target_index = item_index / WARP_SIZE;

            all_states[s] = __shfl_sync(0xffffffff, local_table[target_index], target_lane, WARP_SIZE);

            /*if(offset < 10) {
                printf("offset is %d, ch is %d, s is %d, index is %d, all_states[s] is %d\n", offset, ch, s, item_index, all_states[s]);
            }*/
        }
    }

The result is that the value of all_states[s] is incorrect. For example, I checked the the second element of 22nd lane , it stores 1, however, when I try to read the same value, I get 10. I guess there is a synchronization problem here.

Thanks a lot!

xia.425 · February 26, 2019, 6:28pm

Anyone can give me some insights or suggestions?
Thanks a lot!

Robert_Crovella · February 26, 2019, 8:14pm

I suggest providing a test case that I can copy, paste, compile and run, without having to add anything or change anything. Put a printf in there to indicate the output you don’t understand.

xia.425 · February 27, 2019, 3:17pm

It contains several files in different folders.

I see one statement:
“When Should You Use Register Caching?
There are cases where the register cache is not applicable. First, the access pattern should be known at compile time…”

In this post.[url]https://devblogs.nvidia.com/register-cache-warp-cuda/[/url]

Thanks a lot!

xia.425 · February 27, 2019, 3:22pm

I think “the access pattern should be known at compile time” means that shuffle instructions can not guarantee correctness when the lane id is calculated at runtime. Is my understanding correct?
Thanks a lot!

Robert_Crovella · February 27, 2019, 5:11pm

No, not correct. Register caching there refers to the idea that the compiler can put certain array-style access patterns into registers, if the access indexes can be computed at compile time.

It’s not directly related to warp shuffle. The warp shuffle parameters can all be computed at runtime.

Topic		Replies	Views
Shuffle instructions Announcements	0	1041	February 26, 2019
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	8964	November 29, 2013
C-level Warp Shuffle functions in CUDA 4.2 final Not just for PTX anymore CUDA Programming and Performance	5	4147	June 28, 2012
Shuffle down Instruction returns value "0" on laneId 30? CUDA Programming and Performance	2	1394	December 16, 2014
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	368	July 31, 2024
Why is __shfl slower than shared memory CUDA Programming and Performance	7	5868	November 27, 2014
Does __shfl_*() contains an implicit sync? CUDA Programming and Performance	7	1502	January 31, 2017
Conflict of shfl.sync() CUDA Programming and Performance	16	129	December 15, 2024
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	110	December 7, 2024
a simple shuffle example? CUDA Programming and Performance	7	1705	November 4, 2014

Shuffle instructions

Related topics