Ooops - and mine runs 8,8,6… On the last question I mean that like this code, any complex app is going to have to grab its maximum number of threads in each block and then if one does as here (always use from 0->number required, which is a neat method) the top threads (warps) will always just keep skipping from sync to sync without doing any computation until they are required. Question is does this cost anything in performance?
ed: OK there must be at least a TID comparison.