global void kernel(const int dstcnt,const int2 * dstactive,…)
for (int activationIdId = threadIdx.x; activationIdId < dstcnt; activationIdId += blockDim.x)
int2 activationId = dstactive[activationIdId];
int cid = activationId.x >> 16;
activationId.x &= 0xffff;
activationId.y &= 0xffff;
generates 2 global reads for activationId - one 16bit, one 32bit. Apparently it tries to optimize activationId.y &=0xffff away, but one 64bit read is faster since the buffer is well-aligned.