CUDA for quaternions (hyper-complex numbers) operations


I’m quite new to GPGPU and CUDA, and i need to find a way to do some quaternions operations on GPU. (Such as the product of a vector of quaternions with an other vector of quaternions). To do so i need to use the hamilton product ( equivalent for the product of two quaternions) :

In the end my goal is to perform a kind of feed forward phase doing W * Q where W is a vector of N quaternions and Q an other vector of M quaternions.

I’m firstly looking for a library, but if i need to code my own, i will need some help understanding how i can perform this hamilton product in parallel computing … An interresting thing is that a quaternion can be represented as a matrix of 4*4 real numbers, and the product of two quaternions represented in this way is equivalent to the hamilton product. (But how to represent this in a GPU “way”)

Thanks a lot !

Instead of trying to map operations for a single quaternion to multiple threads, consider running a sequential algorithm working with quaternions for many quaternions in parallel - each thread processing one quaternion only.

Essentially it means taking a CPU implementation of a quaternion class and simply using it on the GPU with almost no modifications.

I’ve only used quaternions for implementing spatial rotations in CUDA, so I probably can’t help you much with your particular problem.

Here’s my quaternion helper class. Probably not the prettiest implementation, but it does the job. Note the use of a float4 vector to hold the 1,i,j,k components of the rotation Quanternion. These map to the x,y,z and w components of the float4 vector. Not sure why I went for a typedef approach for declaring struct _Quaternion as Quaternion.

It’s all 32bit float precision and the entire object can stay in registers when you declare a quaternion as a local variable within the CUDA kernel.

// Quaternion helper class describing rotations
// this allows for a nice description and execution of rotations in 3D space.
typedef struct _Quaternion
    // rotation around a given axis (given sine and cosine of HALF the rotation angle)
    static __device__ __forceinline__ struct _Quaternion describe_rotation(const float3 v, const float sina_2, const float cosa_2)
        struct _Quaternion result;
        result.q = make_float4(cosa_2, sina_2*v.x, sina_2*v.y, sina_2*v.z);
        return result;

    // rotation around a given axis (angle without range restriction)
    static __device__ __forceinline__ struct _Quaternion describe_rotation(const float3 v, const float angle)
        float sina_2 = sinf(angle/2);
        float cosa_2 = cosf(angle/2);
        struct _Quaternion result;
        result.q = make_float4(cosa_2, sina_2*v.x, sina_2*v.y, sina_2*v.z);
        return result;

    // rotate a point v in 3D space around the origin using this quaternion
    // see EN Wikipedia on Quaternions and spatial rotation
    __device__ __forceinline__ float3 rotate(const float3 v) const
        float t2 =   q.x*q.y;
        float t3 =   q.x*q.z;
        float t4 =   q.x*q.w;
        float t5 =  -q.y*q.y;
        float t6 =   q.y*q.z;
        float t7 =   q.y*q.w;
        float t8 =  -q.z*q.z;
        float t9 =   q.z*q.w;
        float t10 = -q.w*q.w;
        return make_float3(
            2.0f*( (t8 + t10)*v.x + (t6 -  t4)*v.y + (t3 + t7)*v.z ) + v.x,
            2.0f*( (t4 +  t6)*v.x + (t5 + t10)*v.y + (t9 - t2)*v.z ) + v.y,
            2.0f*( (t7 -  t3)*v.x + (t2 +  t9)*v.y + (t5 + t8)*v.z ) + v.z

    // rotate a point v in 3D space around a given point p using this quaternion
    __device__ __forceinline__ float3 rotate_around_p(const float3 v, const float3 p)
        return p + rotate(v - p);

    // 1,i,j,k
    float4 q;
} Quaternion;

I’m currently trying to implement something with float4. I’ll try using your approach but does the performance gain is worth it ?The gain will be N times faster where N is the number of quaternion processed ?

Thanks a lot,

it’s not as simple as assuming N CUDA threads run N times faster than the CPU.

a) the CPU is also multithreaded and has 8x (AVX) vector units too.
b) the GPU has very different clock rates and different instruction throughput per clock cycle.
c) the number of CUDA cores available on your GPU is a limiting factor.
d) a lot of CUDA applications will be limited by memory throughput and not by computational speed. Memory throughput is a function of the hardware you buy (in the range of 200 GB/sec to 300 GB/sec for the better nVidia GPUs). This can be about 10x the throughput of a typical single CPU Intel system.

I just noticed my above posted code could make use of the __sincosf function to accelerate these two lines further.

float sina_2 = sinf(angle/2);
float cosa_2 = cosf(angle/2);


float sina_2, cosa2;
__sincosf(angle/2, &sina_2, &cosa_2);

I’m currently programming on GTX 870M but the final Neural Network will run on TITAN X. I’ve tested this approach and it work, but i currently need a huge number of input quaternions to be faster than CPU (Because of the memory alloc and copy on the device). Hope that the increasement of device computation will reduce this number inputs needed (currently only a vector * vector and a matrix * vector + vector on the futur)

Thanks for your advices

Consider storing your quaternions in half float precision (ushort). This about halves the required memory bandwidth for transferring/reading the data.

If you have professional Tesla P100 cards, they can also do half float computations at high speed. The consumer devices (Geforce etc) will have to use __half2float intrinsics and perform the computations in 32 bit float precision for best performance.

also, have you seen this thread?

I’ve already seen this thread, just before posting. And i’m trying to code my own using his solution as a model. I’ll try with half precision float. I also have a question regarding this code :

In Quaternion.cpp the hamilton product is defined as following

__device__ __host__ void Quaternion::hamilton(Quaternion q2){

        q = make_float4(q.w * q2.q.w - q.x * q2.q.x + q.y * q2.q.y + q.z * q2.q.z,
        q.w * q2.q.x + q.x * q2.q.w + q.y * q2.q.z - q.z * q2.q.y,
        q.w * q2.q.y - q.x * q2.q.z + q.y * q2.q.w + q.z * q2.q.x,
        q.w * q2.q.z + q.x * q2.q.y - q.y * q2.q.x + q.z * q2.q.w);

In Main.cpp this product is called in this kernel :

__global__ void mulKernel(Quaternion *q, Quaternion *q2, int size){
        int idx = threadIdx.x + blockIdx.x*blockDim.x;
        if(idx < blockDim.x)

each q. and q2. in the hamilton function reprensent one single read in the global memory ? If so, maybe the use of shared memory would be better no ?