Questions about lane operation

Hello everyone,long time no see you. I am confusing on lane operations,hope you can give me some helps, many thanks!

Suppose I start up kernel with following parameter:

dim3 grid(Ceil(size, 32));
dim3 block(32, 32);

And the kernel like following( in order to simplify, i delete some code, fully code here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Embedding.cu)

static const int WARP_SIZE = 32;
static const int BLOCKDIMY = 32;

global void embedding_backward_feature_kernel
(int64_t* indices,
int n)
{
extern shared char buf;
int* indices_batch = (int*)(buf + sizeof(int)WARP_SIZEblockDim.y);

int tid = threadIdx.x + threadIdx.y*blockDim.x;
if(tid < n)
  indices_batch[tid] = (int)indices[tid];

int dst_row = indices_batch[threadIdx.y];

int match_found_this_thread =(dst_row == indices_batch[threadIdx.x]);

unsigned long long int matchmask =__ballot(match_found_this_thread);
int first_remaining_peer = __ffsll(matchmask) - 1;

if(threadIdx.y == first_remaining_peer){
do_something();
}

my question is:

  1. Are indices_batch[threadIdx.y] and indices_batch[threadIdx.x] refer to the same value?
  2. why code in orange can find the warp operate same dst_row?
  1. generally, no. threadIdx.y and threadIdx.x are not the same, so the value could not be the same
  2. don’t understand the english here

Let me try to make my question clear, but not sure because i am still confusing…

So based on my understand, variable dst_row save one warp’s inddex, for example threadIdx.y=0 and threadIdx.x(0-31) construct warp0, warp0’s one index is saved into dst_row.

And the code “int match_found_this_thread =(dst_row == indices_batch[threadIdx.x]);” will check each thread in warp0, if have same value as dst_row, match_found_this_thread will be true, then use __ballot(match_found_this_thread) to get a mask for threads in warp0, so this mask means all threads have same index.