The easiest way to do a per warp critical section is to just make all threads in the warp operate on the exact same data - hence, if you wanted only the first element of an array processed by a given warp, have all threads in the warp just process element 0. Since the warp is executed in lockstep, every thread will be computing identical results to if there were just one thread running.
So, to do a warp wide critical section, simply make all threads run through identical code paths with identical data input.