SM occupancy question

My kernel is launched with ~1024 threads and ~1000 blocks

however a section of the code is only processed by the first 5 blocks, ie I have a conditional like:

if(blockDim.x < 5) { do x; } else {do nothing; }

My question is how “idle” will the streaming multiprocessors be?

I really don’t want to launch another kernel with like ~1024 threads and only 5 blocks - as I want to reduce the number of times I load from global into shared memory (as shared memory isn’t persistent between kernel calls)