“For optimization reason, we like to launch least number of GPU threads but larger than number of source rows. It is predictable for the first step, but not easy job to estimate exact number of rows generated by join prior to execution”
i fail to see the mentioned optimization
if you have kernel threads that are sufficiently long-lived such that redundant threads are actually a concern, and if you have a sufficient number of kernel threads, you can easily build in flexibility in kernel dimensions, by moving to semi-persistent threads, in my opinion
no device can seat 30k threads at the same time
hence, there should be little difference between issuing kernel blocks with 30k threads, and kernel blocks with 1k threads, doing the work of 30k threads
if a thread is going to get up, just so that another can sit down, why should the 1st thread get up in the first place?