pop-count operations on GPUs

I am wondering if there is any instruction level support for doing pop-count operations

(counting the number of 1s in an arbitrary sized vector) on GPUs. Thanks.

No. But if you look at the reduction sample in the SDK you can modify it to do so easily.

Not really specific to CUDA on Linux…

You may already be aware of these resources, but just in case:

Pop-count in general is something that it’s nice to have an instruction for, but there are good ways of doing it with a few machine instructions, no loops and no memory accesses (always a good thing on CUDA).

See the fantastic 'Hackers Delight" from Henry S. Warren and the page “Bit Twiddling Hacks”:

[url=“Bit Twiddling Hacks”]http://graphics.stanford.edu/~seander/bith...BitsSetParallel[/url]

There are several ways to do pop-count like you said; however I am wondering how the performance of nvidia GPU will be , will be compared to Penryn/Nehalem which will have a 64-bit popcnt instruction.

Use the bit counts without mod (%), multiplication (*) or lookup tables. The ones using only bitmasks, + and bitwise logic ops are the fastest in my experience on G80 hardware. There is no instruction support for doing it directly, but you hardly have to worry about the speed of basic bitwise ops in CUDA as they’re very fast.

Yes, that’s my experience, too. I would recommend the ‘horizontal add’ variant that I think was the specific target of my link.

Penryn has SSE 4.1, but no existing architecture has SSE 4.2. POPCNT will be available in Nehalem and beyond, only.

As for speed comparisons, neither architecture is likely to be bottlenecked on this operation (as a rule, integer operations out of registers are cheap).

Remember that you can do up to 128 of the G80 POPCNT sequences at once; it’s an open question as to whether a Nehalem will be able to do more than 4-8 POPCNT instructions simultaneously. That’s a guess; 4 cores x 0.5-1 cycle throughput - this is optimistic, as ‘horizontal’ operations on SSE currently take >1 cycles, and Intel may not pull out all stops for an ‘application-specific acceleration’ instruction). If it was really easy to do POPCNT with high throughput with existing functional units, I suspect it would already be in SSE 4.1 or earlier.

And that’s comparing a 2006 G80 with a 2H 2008 Intel CPU.