CUDA intrinsics?

dave4ff8k · November 15, 2017, 4:08pm

Hi all,

So I am hoping to use CUDA to speed up my image processing convolution.
I am using the Maxwell GPU on my Jetson TX1 - though will be upgrading to another embedded system with a more recent GPU.
I have worked through the sample code for separable convolution (as my 5x5 kernel is separable) - however this works with floats.
I am hoping to convolve an input image of 8 bit ints and output 8 bit ints (though I’d like a 16 bit int intermediate representation).
Hence I was hoping to use SIMD intrinsics as well as the parallelisation CUDA provides to get even more speed up.
A bit of googling suggests some intrinsic support was provided with Kepler devices onwards (see CUDA Math API :: CUDA Toolkit Documentation), but reading seems to suggest this was only implemented in hardware with Kepler devices and emulated from there onwards.
I also read that newer GPU’s (Pascal onwards) support a different form of intrinsics (see https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/).

My question is what is the difference, and which one should I use moving forwards?

My other question - which may be slightly daft… but when I’ve used Intel intrinsics in the past I’ve used pointers. Why do the versions with CUDA use functions which presumably uses copies? Doesn’t this lead to poor performance? Or does it get optimised out?

Thanks in advance for any suggestions.

njuffa · November 15, 2017, 4:51pm

Keep in mind that the SIMD hardware instructions on Kepler generally only ran at 1/4 throughput, so for many common SIMD intrinsic, the emulated version is almost as fast, being mostly comprised of instructions with full throughput. GPUs after Kepler retain the hardware support for sum-of-absolute-differences, which would be costly to emulate.

If the limited collection of operations provided by the SIMD intrinsics is sufficient for your processing of byte-wise data (this is usually the case for sequence matching in bioinformatics, for example), I would suggest giving that a try, as the word-wise handling of byte-sized data also has benefits for achieving good memory bandwidth.

When considering use of some of the newer hardware instructions, carefully check which platforms you are going to use. For example, other than P100, Pascal-family GPUs have very low throughput for FP16 operations.

No idea what you mean. SSE instrinsics typically take __m128, __m128i, or __mm128d arguments, while AVX intrinsics use __m256, __m256i, or __mm256d arguments. These arguments then represent registers in the generated machine code.

dave4ff8k · November 15, 2017, 5:34pm

Many thanks for your quick reply.
Thanks for clarifying the Kepler hardware support issue.

I am going to be loading patches of image into thread blocks and then each thread is going to combine multiple pixels into an output value. The input pixels are 1 byte, I was going to change it to 2 bytes, do some multiplying and adding, then shift back down to 1 byte output pixel. I was hoping to use intrinsics so that I could get a speed up by combining my pixel operations into single operations.

Re. the new hardware instructions - this maybe a silly question. But if throughput is low for some operations. How do I know whether the instructions are worth using for optimisation? Is there reference material, or do you just know from experience?

I think my final question that you didn’t understand didn’t actually make sense. For an erroneous reason I expecting the function like:

device unsigned int __vadd4 ( unsigned int a, unsigned int b )

to take pointer arguments. This was because I had seen them used with dereferenced pointers, and I was getting confused! Sorry.

dave4ff8k · November 15, 2017, 5:35pm

Many thanks for your quick reply.
Thanks for clarifying the Kepler hardware support issue.

I am going to be loading patches of image into thread blocks and then each thread is going to combine multiple pixels into an output value. The input pixels are 1 byte, I was going to change it to 2 bytes, do some multiplying and adding, then shift back down to 1 byte output pixel. I was hoping to use intrinsics so that I could get a speed up by combining my pixel operations into single operations.

Re. the new hardware instructions - this maybe a silly question. But if throughput is low for some operations. How do I know whether the instructions are worth using for optimisation? Is there reference material, or do you just know from experience?

I think my final question that you didn’t understand didn’t actually make sense. For an erroneous reason I expecting the function like:

device unsigned int __vadd4 ( unsigned int a, unsigned int b )

to take pointer arguments. This was because I had seen them used with dereferenced pointers, and I was getting confused! Sorry.

njuffa · November 15, 2017, 5:53pm

To my recollection, none of the SIMD intrincis include multiplication, so you would have to implement that part of the packed processing yourself. The __byte_perm() intrinsic may come in handy for that.

The low throughput of FP16 operations on sm_61 Pascal devices is documented by NVIDIA, but I do not recall whether it is in the Programming Guide or elsewhere. From memory, FP16 throughput is 1/64 of the FP32 throughput. You could also measure the rates quite easily, at least approximately. The take-home message is that this low-throughput FP16 support is useful for prototyping (to establish functional correctness), but not much more.

dave4ff8k · November 15, 2017, 6:09pm

Good point - I didn’t realise that.

So I found my info on the new instructions here:

https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

I had held out hope for big improvements as they said:

“The new NVIDIA Tesla P100, powered by the GP100 GPU, can perform FP16 arithmetic at twice the throughput of FP32. The GP102 (Tesla P40 and NVIDIA Titan X), GP104 (Tesla P4), and GP106 GPUs all support instructions that can perform integer dot products on 2- and4-element 8-bit vectors, with accumulation into a 32-bit integer. These instructions are valuable for implementing high-efficiency deep learning inference, as well as other applications such as radio astronomy.”

They go on to suggest even better with 8 bit ints… Maybe I have misunderstood?

njuffa · November 15, 2017, 6:15pm

As the accurate description says: sm_60 has the fast FP16 processing, sm_61 has a couple of specialized instruction, DP2A and DP4A, for integer dot products. These are not in sm_60, best I understand.

If your processing can make use of a low-precision integer dot product, then great. Since you mentioned multiplication, it may be a good fit. The way to find out would be to hack together a quick prototype.

dave4ff8k · November 16, 2017, 10:29am

Many thanks for your help. It is much appreciated.

Topic		Replies	Views
Faster __vsubus4() implementation CUDA Programming and Performance	3	1234	July 2, 2016
Mixed-Precision Programming with CUDA 8 Technical Blog	1	378	February 23, 2017
A question about calculation of integer (or short integer) and float data CUDA Programming and Performance	8	3240	April 4, 2014
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1717	October 12, 2021
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1099	September 27, 2024
Future support/extension of CUDA SIMD intrinsics CUDA Programming and Performance	4	2358	September 29, 2016
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19855	March 12, 2014
Forward looking GPU integer performance CUDA Programming and Performance	22	21184	March 20, 2017
Mapping of pipelines to functional units CUDA Programming and Performance	7	33	November 30, 2024
Profiling performance and energy consumption of the basic operations CUDA Programming and Performance	5	39	September 25, 2024

CUDA intrinsics?

Related topics