CUDA basic discussions(Looping, branching) Why GPU faster than CPU when both use C?

We are directed to ask questions here. Any ideas will be deeply appreciated.

Thanks.

  1. First of all, we are keen on the sample C program in CUDA compared to the typical C program we have written for the long time. ( P.29/38 of the Tesla Computer Overview slide). So the new CUDA program get a rid of the loop, and it tries to take the results from the main functions directly to the dimensions of the array. Does it look like a direct mapping? We believe that this way there will be less data dependencies we have to worry about for data parallelism. What other significant changes does CUDA have from typical C program?

  2. Second, if CUDA is based on C and the way of CUDA C program is written is weird but not so different than typical C program, so now you put the C CUDA program in CPU and GPU respectively. Why can CPU not run as fast as GPU? Beside ILP, what else would we have to worry about for CUDA C programming? What about steppting back and thinking about using traditional C programs, force it into CUDA? Is there a lot of performance deficiencies because of the possible branching outcomes and data hazards or perhaps the inefficient double precision(64bit) calculations in the 16 SPs?

  3. Same page, (P.29/38). when we use integers to identify the parameters, it takes longer than floating point. for the statement:

if (i < N && i < N) c[index] = a[index] + b[index]

 We think this is a right away to store the data, but is there a better way to improve this instruction? 

Thanks,
Paul

Performance is not a function of what programming language you use!

The G80 hardware is mainly faster because it has 128 processors - each is processing a separate piece of data in parallel.

Have you read the programming guide yet? There is loads of good information in there.

Having said that, I will offer some advice. Analyzing the performance is highly dependent on the algorithm you are implementing. Lets assume for the moment that your algorithm can be made data-parallel, the first and hardest task of implementing an algorithm on the GPU. If you cannot process tens of thousands of pieces of data independently from one another, you aren’t going to implement your algorithm efficiently on the GPU.

With that assumption, the first step in analyzing your performance is to calculate what performance you would expect if the device is operating near its theoretical limits. If your algorithm in memory access bound, you can reasonably expect to get ~60 to ~70 GB/s of data transferred from global memory into registers (theoretical max is around ~80 GB/s). If you are compute bound instead, you can reasonably expect around ~150 to 200 GFLOP/s, depending highly on the details of the computation. If every single operation is a Multiply-Add, the theoretical max is 340 GFLOP/s.

I have the most experience with memory bound applications on the GPU. The rules in the programming guide should be prioritized as follows:

  1. Memory access pattern
  2. Memory access pattern
  3. Memory access pattern
  4. Get a reasonable warp occupancy (at least ~66% or so)
  5. Avoid shared memory bank conflicts
  6. (Probably not worth mentioning) avoid divergent warps

In my experience, pushing to get that last few % warp occupancy only changes performance by a tiny fraction, if at all, so it isn’t worth it. But there is a fair amount of performance to be gained going from, say, 30% occupancy to 60%.

I got the impression from the guide that divergent warps are the root of all evil and should be avoided at all costs. At least in my application, this is not so. The implementation with divergent warp performs 10% faster than my divergent-free implementation, which needs to perform ~15% “wasted” computations to avoid divergences. Obviously, the hardware is very capable of handling divergent warps. Maybe there would be a bigger difference in a more compute bound application.

Hope this helps.

Thanks sir, I will look into it and get back to you soon. Thanks a lot for a brief guideline.

Thanks.

You said the priority should be like:

  1. Memory access pattern

  2. Memory access pattern

  3. Memory access pattern

  4. Get a reasonable warp occupancy (at least ~66% or so)

  5. Avoid shared memory bank conflicts

  6. (Probably not worth mentioning) avoid divergent warps

May I ask what exactly the first three are about? Are they talking about different memory like device memory, local memory and constant memory? Please let me know.

You also said that divergent warps are the key thing to CUDA C. Could you cite two examples from the CUDA samples or part of your programs to see how divergent warps could be found in the programs? I am still digesting the material I read in the guide, so I wanna have more assistance on it.

Thanks.

I just repeat the first 3 for emphasis. Since I was specifically discussing applications where the accessing of memory is the performance bottleneck, the thing that matters most is casting the algorithm in such a way that the memory access pattern is optimal on the device.

In particular, this means making sure that global memory reads are all fully coalesced (read the guide for details). If your app needs to access memory in not such a nice clean pattern, then use textures, but make sure that you have good data locality in your accesses throughout all threads in a warp.

Fully coalesced memory reads (and writes) should give ~70 GB/s data throughput.

Uncoalesced memory reads (and writes) may only give ~2 GB/s.

In my application, completely random reads from a 1D texture are ~10-20GB/s and random reads with good data locality are ~64 GB/s.

Depending on your particular application, memory access patterns for the shared and constant memory spaces can be significant. I only emphasize global memory access patterns in this post because, in my experience with my application, they matter the most.

For a really simple example of a divergent warp, consider this simple code to multiply all elements of a vector by a constant. N is the length of the vector.

__global__ void mul_const(float c, float *a, int N)

// each block will handle a block of the vector, the index into a is generated like so

int idx = blockIdx.x * blockDim.x + threadIdx.x;

// don't process values after N in the array

if (idx < N)

    {

    float val = a[idx];

    a[idx] = val * c;

    }

For illustration purposes, lets say N = 35. The first warp will handle the first 32 elements. The if statement will follow the same branch for all 32 and there is no divergence for that warp. The second warp will have some threads that follow one branch and some that follow the other, so that warp is divergent.

In this really simple code, it’s easy to write a divergent free version: just allocate a little extra memory at the end of the array and remove the if statement. A few extra “dummy” computations are run at the end of the last block, but in this simple example it won’t make a huge difference.

If a lot more computations were done in the if, more computations are be wasted, making the divergent solution more attractive.

Things get a lot more complicated when you start adding loops and if statements that depend on the values read from the data, and really depends on the application. It’s hard to think of a simple enough example to illustrate this.