Hello everyone,long time no see you. I am confusing on lane operations,hope you can give me some helps, many thanks!
Suppose I start up kernel with following parameter:
dim3 grid(Ceil(size, 32));
dim3 block(32, 32);
And the kernel like following( in order to simplify, i delete some code, fully code here: pytorch/Embedding.cu at master · pytorch/pytorch · GitHub)
static const int WARP_SIZE = 32;
static const int BLOCKDIMY = 32;
global void embedding_backward_feature_kernel
(int64_t* indices,
int n)
{
extern shared char buf;
int* indices_batch = (int*)(buf + sizeof(int)WARP_SIZEblockDim.y);
int tid = threadIdx.x + threadIdx.y*blockDim.x;
if(tid < n)
indices_batch[tid] = (int)indices[tid];
int dst_row = indices_batch[threadIdx.y];
int match_found_this_thread =(dst_row == indices_batch[threadIdx.x]);
unsigned long long int matchmask =__ballot(match_found_this_thread);
int first_remaining_peer = __ffsll(matchmask) - 1;
if(threadIdx.y == first_remaining_peer){
do_something();
}
my question is:
- Are indices_batch[threadIdx.y] and indices_batch[threadIdx.x] refer to the same value?
- why code in orange can find the warp operate same dst_row?