slower thrust reductions on Windows 7/8 compared to Linux?

Hi all, I worked out a simple example to show the differences. I have a computational electromagnetics code that ends up doing anywhere from 90-360 reductions at best-case (on the order of 60k reductions worst case, but that’s a topic for a different thread…)

I noticed a while back that when I tried my code on a GTX Titan on Windows, the performance was pretty bad, and I realized the reason was because of thrust-based reductions. Basically I have to reduce real/imaginary parts of x,y,z components of electric/magnetic fields over some number of observation points. The 90-360 (up to 60k) figure is observation points. I figured out that if I used a double3 struct, I achieved much better performance in the reductions via a thrust custom reduction kernel, like the one here:

While that’s all fine and dandy, the reductions still take a long time in Widnows. Here’s my data points:

90 reductions on an 80k element double array:
GT740 (GK208) Linux 35 ms
GT740 (GK208) Windows (WDDM) 74 ms

Quadro K6000 Windows (TCC) 86 ms
Quadro K6000 Windows (WDDM) 150 ms
Quadro K6000 Linux 20 ms

My first question for now is… why do these thrust reductions take considerably longer time in Windows, even with TCC driver? Code attached below:

struct mydouble3 {
    double x, y, z;
};

struct mydouble3complex {
    double rx, ry, rz;
    double ix, iy, iz;
};


// double3 + struct
struct add_mydouble3 {
    __device__ mydouble3 operator()(const mydouble3& a, const mydouble3& b) const {
        mydouble3 r;
        r.x = a.x + b.x;
        r.y = a.y + b.y;
        r.z = a.z + b.z;
        return r;
    }
};

// double3complex + struct
struct add_mydouble3complex {
    __device__ mydouble3complex operator()(const mydouble3complex& a, const mydouble3complex& b) const {
        mydouble3complex var;
        var.rx = a.rx + b.rx;
        var.ry = a.ry + b.ry;
        var.rz = a.rz + b.rz;

        var.ix = a.ix + b.ix;
        var.iy = a.iy + b.iy;
        var.iz = a.iz + b.iz;

        return var;
    }
};

#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>

#include <thrust/reduce.h>
#include <thrust/device_ptr.h>

#define NBLOCKS 256 // number of threads per block

#define CUDA_CHECK_ERROR // Enable this for error checking

#define CudaSafeCall( err )     __cudaSafeCall( err, __FILE__, __LINE__ )
#define CudaCheckError()        __cudaCheckError( __FILE__, __LINE__ )

inline void __cudaSafeCall( cudaError err, const char *file, const int line )
{
#ifdef CUDA_CHECK_ERROR

#pragma warning( push )
#pragma warning( disable: 4127 ) // Prevent warning on do-while(0);

    do
    {
        if ( cudaSuccess != err )
        {
            printf("cudaSafeCall() failed at %s:%i : %s\n",
                     file, line, cudaGetErrorString( err ) );
            // exit( -1 );
        }
    } while ( 0 );

#pragma warning( pop )

#endif  // CUDA_CHECK_ERROR

    return;
}

inline void __cudaCheckError( const char *file, const int line )
{
#ifdef CUDA_CHECK_ERROR

#pragma warning( push )
#pragma warning( disable: 4127 ) // Prevent warning on do-while(0);

    do
    {
        cudaError_t err = cudaGetLastError();
        if ( cudaSuccess != err )
        {
            printf("cudaCheckError() failed at %s:%i : %s.\n",
                     file, line, cudaGetErrorString( err ) );
            // exit( -1 );
        }

        // More careful checking. However, this will affect performance.
        // Comment if not needed.
        err = cudaThreadSynchronize();
        if( cudaSuccess != err )
        {
            printf("cudaCheckError() with sync failed at %s:%i : %s.\n",
                     file, line, cudaGetErrorString( err ) );
            // exit( -1 );
        }
    } while ( 0 );

#pragma warning( pop )

#endif // CUDA_CHECK_ERROR

    return;
}

int main(void) {

int nT = 80000;
int nr = 90;
int ind1, i;

cudaError_t cudaStatus;
cudaEvent_t start, stop;
float elapsedtime;

int sizeT = sizeof(double)*nT;
int sizer = sizeof(double)*nr;

double * arr_h, * arr_d, * output;
arr_h  = new double[nT];
output = new double[nr];

for(i=0; i<nT; i++) {
	arr_h[i] = i;
}

CudaSafeCall( cudaMalloc( (void**)&arr_d, sizeT ) );
CudaSafeCall( cudaMemcpy(arr_d, arr_h, sizeT, cudaMemcpyHostToDevice) );

// wrap raw pointer with a device_ptr
thrust::device_ptr<double> arr_ptr(arr_d);


// cuda timer start
cudaEventCreate(&start);
cudaEventCreate(&stop); 
cudaEventRecord(start, 0);

// simulates nr reductions
for (ind1=0; ind1<nr; ind1++) {
	output[ind1] = thrust::reduce(arr_ptr, arr_ptr + nT);
}

cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0); 
cudaEventSynchronize(stop); 
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
// cuda timer end


for (ind1=0; ind1<nr; ind1++) {
	printf("output[%d] = %g\n", ind1, output[ind1]);
}

printf ("Time for reductions: %f ms\n", elapsedtime);

}

Using that exact code on my W7 machine I get an average 32 ms for 90 reductions, though there is some variance when I run multiple times(as high as 56 ms and as low as 31 ms).

This is using K20c with TCC driver, Visual Studio 2010 x64

What you point out is odd, maybe not using thrust would help.

I did some quick math before I wrote the post and think NVIDIA’s reduction sdk sample might be about the same time-wise, but I’ll try it out and see how it performs.

I also saw variations myself, especially under Windows, both with and without TCC mode about as different as you did. Strangely enough on Linux there isn’t more than 1-2ms variation in the times I found, whereas in Windows it’s more of 20-30ms variance for this specific case.

What I have not seen anywhere is an example on how to do reductions on say, a float3 or double3 variable. Every example I’ve seen just assumes a simple array, but given I’m trying to reduce x,y,z components over many reductions, it would seem like it would be faster to do 3 variables at once, avoiding the extra kernel overheads.

I did play around with different versions of double reduction code, and found it is not hard to beat thrust::reduce() even using atomics (I mean the non-official double atomicAdd() function in combination with shared memory).

For thrust::reduce() an array of (2^25) it takes about 2 ms, as does my simple implementation which has each thread sum 32 memory strided elements then uses shared memory to sum the block. That block amount is atomicAdd(ed) to a single size global double pointer(which was initialized to zero).

On my 3.9GHz CPU it took about 77 ms for the same double reduction code.

My non-optimal CUDA version also took 2 ms like thrust(), but if you used a even better implementation like Jimmy P’s:

https://devtalk.nvidia.com/default/topic/520792/?comment=3838737

you should be able to beat thrust::reduce() for 64 bit doubles.

Is there a native double3 aligned type? I see ulong3 and ulong4 , but no 64 bit 3 element type.

What would be optimal way to implement an aligned double3 type? This seems like something that could come up frequently.

Oh, and this testing was done using Windows 7 with TCC driver. If you want to see the reduction code mentioned PM me, but Jimmy’s is probably faster(though that version uses more memory than mine).

Also is does __shfl work for 64 bit numbers? I see only 32 bit examples.