As I read and heard that the thread size assigned to a block should be always multiple of the warp size(32), otherwise the performance is dropped. But I am not clear why the performance is dropped.
In my condition, every thread processes one element, the number of elements in one task is K and K is always littler than 1024. This is to say that K can be any 1~1024.
- Can I set bloke_size to K?
- Is it better than set block_size to
(with (K + 31) >> 5) << 5?