Theoretical performance using doubles

Hi,

It is exciting to know that NVIDIA GPUs will finally have support for double precision. Since its a new feature, I was wondering what it means for performance of raw compute or memory operations. It would be great if someone could give numbers or explain what to theoretically expect.

I performed a quick experiment to see the effect of using doubles in clearly compute-bound and clearly memory-bound situations. I notice that in the first case, kernel execution time increases to around 7.6 times when all data (both host and device) is changed to doubles. In the memory bound case, the performance drop is a factor of around 3.3. I guess it would be better for real-life kernels.

I’m not sure what to infer from these numbers, could someone please shed some light at what would have been expected, and why? Here are the kernels I used:-

  1. Compute-bound: This kernel performs a series of dependent MADs, reading an input from one array and eventually writing to another.
__global__ void

gpu_cb(float myseed, float *myI, float *myO, unsigned int length)

{

        float x, acc;

        int idx=blockIdx.x*blockDim.x + threadIdx.x;

        x=*(myI+idx);

        acc=x*myseed;

#pragma unroll

        for(int i=0;i<256;i++)

        {

                acc+=(acc*x);

        }

        *(myO+idx)=acc;

}
  1. Memory bound: This kernel performs a series of dependent (coalesced) memory operations. I can see from the ptx and the profiler that the equivalent number of memory operations do occur (compiler does not optimize any out). For this case the speed of the ‘add’ operation after every access might have interfered with performance measurement, especially because accesses are coalesced. What is a better way to design such a micro-benchmark?
__global__ void

gpu_mb(float myseed, float *myI, float *myO, unsigned int length)

{

        float x, acc;

        int idx=blockIdx.x*blockDim.x + threadIdx.x;

        float *add;

        add=myI+idx;

        x=*add;

#pragma unroll

        for(int i=0;i<256;i++)

        {

                add += 16;

                x+=*add;

        }

        acc=x*myseed;

        *(myO+idx)=acc;

}

Thanks for reading and I apologize for errors,

Anjul

For memory bandwidth testing: see the bw_test program posted here http://forums.nvidia.com/index.php?showtop…ndpost&p=292058
Given the new coalescing rules on sm13 hardware, bw_test should be expanded with some new types (like char, double2 double4 (is there a double4??))

For benchmarking the compute abilities: see the test by Simon Green at http://forums.nvidia.com/index.php?showtop…ndpost&p=250179

I’ll be testing these myself as soon as I get the engineering sample running.

Hello,

I need Programming advice for the new board:
I run a lattice Boltzmann kernel for fluid simulation.

on a 8800 Ultra I reach 560 MLUPS (just a number) = throughput 60 GByte/secwith single precision.

on the new board I get:

single precision: MLUPS 632 throughput 68 (!) GByte/sec
double precision: MLUPS 114 throughput 12 GByte/sec

The numerical results are fine for the CFD simulation in both cases.

double precision case:
I load the values as floats and convert them in the local collision algorithm
to double values and then to floats before writing them back.
(I hoped to accelerate the calculation by only using floats for the shift operation
so that I can exploit the bandwidth better)
Probably this is a bad thing to do ?

Is it better to load and store it in device memory also as double precision ?
(But then the performance will be divieded by 2 to to the bandwidth at least)

Best
Jonas