Sources of "warp serialize" events in profiler output?

I’m profiling my fully ported from CPU code to OpenCL kernel with the NVidia OpenCL profiler. I’m seeing a high “warp serialize” event count (24 million). I’ve already changed my access to local (shared) memory to avoid bank conflicts, that lowered the “warp serialize” count a little bit (25 million to 24 million). I also changed any use of double to float (that didn’t have any effect).

So I’m running out of possible candidates. I make heavy use of texture memory. Does each texture access incur a warp serialize? If that is the case it would explain the high count. There also a single barrier call per kernel, but that shouldn’t cause that many warp serializations. I have a few divergent branches (~3.6K).

Here’s the raw profiler data, perhaps I’m chasing the wrong thing. Sorry, I don’t know a better format for sharing this.