CUDA intrinsics?

Hi all,

So I am hoping to use CUDA to speed up my image processing convolution.
I am using the Maxwell GPU on my Jetson TX1 - though will be upgrading to another embedded system with a more recent GPU.
I have worked through the sample code for separable convolution (as my 5x5 kernel is separable) - however this works with floats.
I am hoping to convolve an input image of 8 bit ints and output 8 bit ints (though I’d like a 16 bit int intermediate representation).
Hence I was hoping to use SIMD intrinsics as well as the parallelisation CUDA provides to get even more speed up.
A bit of googling suggests some intrinsic support was provided with Kepler devices onwards (see CUDA Math API :: CUDA Toolkit Documentation), but reading seems to suggest this was only implemented in hardware with Kepler devices and emulated from there onwards.
I also read that newer GPU’s (Pascal onwards) support a different form of intrinsics (see https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/).

My question is what is the difference, and which one should I use moving forwards?

My other question - which may be slightly daft… but when I’ve used Intel intrinsics in the past I’ve used pointers. Why do the versions with CUDA use functions which presumably uses copies? Doesn’t this lead to poor performance? Or does it get optimised out?

Thanks in advance for any suggestions.

Keep in mind that the SIMD hardware instructions on Kepler generally only ran at 1/4 throughput, so for many common SIMD intrinsic, the emulated version is almost as fast, being mostly comprised of instructions with full throughput. GPUs after Kepler retain the hardware support for sum-of-absolute-differences, which would be costly to emulate.

If the limited collection of operations provided by the SIMD intrinsics is sufficient for your processing of byte-wise data (this is usually the case for sequence matching in bioinformatics, for example), I would suggest giving that a try, as the word-wise handling of byte-sized data also has benefits for achieving good memory bandwidth.

When considering use of some of the newer hardware instructions, carefully check which platforms you are going to use. For example, other than P100, Pascal-family GPUs have very low throughput for FP16 operations.

No idea what you mean. SSE instrinsics typically take __m128, __m128i, or __mm128d arguments, while AVX intrinsics use __m256, __m256i, or __mm256d arguments. These arguments then represent registers in the generated machine code.

Many thanks for your quick reply.
Thanks for clarifying the Kepler hardware support issue.

I am going to be loading patches of image into thread blocks and then each thread is going to combine multiple pixels into an output value. The input pixels are 1 byte, I was going to change it to 2 bytes, do some multiplying and adding, then shift back down to 1 byte output pixel. I was hoping to use intrinsics so that I could get a speed up by combining my pixel operations into single operations.

Re. the new hardware instructions - this maybe a silly question. But if throughput is low for some operations. How do I know whether the instructions are worth using for optimisation? Is there reference material, or do you just know from experience?

I think my final question that you didn’t understand didn’t actually make sense. For an erroneous reason I expecting the function like:

device ​ unsigned int __vadd4 ( unsigned int a, unsigned int b )

to take pointer arguments. This was because I had seen them used with dereferenced pointers, and I was getting confused! Sorry.

Many thanks for your quick reply.
Thanks for clarifying the Kepler hardware support issue.

I am going to be loading patches of image into thread blocks and then each thread is going to combine multiple pixels into an output value. The input pixels are 1 byte, I was going to change it to 2 bytes, do some multiplying and adding, then shift back down to 1 byte output pixel. I was hoping to use intrinsics so that I could get a speed up by combining my pixel operations into single operations.

Re. the new hardware instructions - this maybe a silly question. But if throughput is low for some operations. How do I know whether the instructions are worth using for optimisation? Is there reference material, or do you just know from experience?

I think my final question that you didn’t understand didn’t actually make sense. For an erroneous reason I expecting the function like:

device ​ unsigned int __vadd4 ( unsigned int a, unsigned int b )

to take pointer arguments. This was because I had seen them used with dereferenced pointers, and I was getting confused! Sorry.

To my recollection, none of the SIMD intrincis include multiplication, so you would have to implement that part of the packed processing yourself. The __byte_perm() intrinsic may come in handy for that.

The low throughput of FP16 operations on sm_61 Pascal devices is documented by NVIDIA, but I do not recall whether it is in the Programming Guide or elsewhere. From memory, FP16 throughput is 1/64 of the FP32 throughput. You could also measure the rates quite easily, at least approximately. The take-home message is that this low-throughput FP16 support is useful for prototyping (to establish functional correctness), but not much more.

Good point - I didn’t realise that.

So I found my info on the new instructions here:

https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

I had held out hope for big improvements as they said:

“The new NVIDIA Tesla P100, powered by the GP100 GPU, can perform FP16 arithmetic at twice the throughput of FP32. The GP102 (Tesla P40 and NVIDIA Titan X), GP104 (Tesla P4), and GP106 GPUs all support instructions that can perform integer dot products on 2- and4-element 8-bit vectors, with accumulation into a 32-bit integer. These instructions are valuable for implementing high-efficiency deep learning inference, as well as other applications such as radio astronomy.”

They go on to suggest even better with 8 bit ints… Maybe I have misunderstood?

As the accurate description says: sm_60 has the fast FP16 processing, sm_61 has a couple of specialized instruction, DP2A and DP4A, for integer dot products. These are not in sm_60, best I understand.

If your processing can make use of a low-precision integer dot product, then great. Since you mentioned multiplication, it may be a good fit. The way to find out would be to hack together a quick prototype.

Many thanks for your help. It is much appreciated.