yo.
im trying to learn how independent thread scheduling works.
i have simple kernel that runs heavy or light task based on warp lane.
if warp lane is odd then we do three global memory read/write and if it’s even we do nothing.
__global__ void mb_its(int *mem, struct timing *timing)
{
int idx = threadIdx.x;
int lane = idx % 32;
timing[lane].start = clock();
if (lane % 2)
{
if (idx == 1) printf("%d in\n", idx);
timing[lane].inloop = clock();
if (idx == 1) printf("%d mem1\n", idx);
mem[idx] = mem[idx] + idx;
if (idx == 1) printf("%d mem2\n", idx);
mem[idx] = mem[idx] + idx;
if (idx == 1) printf("%d mem3\n", idx);
mem[idx] = mem[idx] + idx;
if (idx == 1) printf("%d out\n", idx);
timing[lane].inloopend = clock();
}
else
{
if (idx == 0) printf("%d in\n", idx);
timing[lane].inloop = clock();
if (idx == 0) printf("%d out\n", idx);
timing[lane].inloopend = clock();
}
timing[lane].outloop = clock();
}
my expectation here is NOT to see prints of thread 0 and then thread 1 (or vice versa) in order. what i have understand from ITS says prints must be like this:
1 in
0 in
1 mem1
0 out
…
but its like this:
1 in
1 mem1
1 mem2
1 mem3
1 out
0 in
0 out
why??
compiling with:
nvcc main.cu
also tried:
nvcc main.cu -rdc=true
running on GTX 1650 TI (CC 7.5)
nvcc version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176