GPU (Cuda) vs. SSE3 Results of my Vector Addition


I have a performance problem… my vector additon result for 1000 elements are.

Normal: 0.00587469060 (msecs)

SSE3: 0.00281704427 (msecs)

only exekution (without copy operations and alocations)

GPU (Cuda): 0.08995556831 (msecs)

i think the gpu must be for 1000 elements faster.

The pdf shows the real comparing graphs for the sse3 & gpu by 1000 to 1000000 elements.

Whats my fault?

Please help me.

here my code:




VectorAddition<<< grid, threads >>>( vector1_gpu, vector2_gpu, result_gpu);


performanceMetrics.exe_GPU_time += cutGetTimerValue(hTimer);


__global__ void VectorAddition(float* vector1, float* vector2, float* result){

    const int blocksPerRow = gridDim.x;

   int bx = blockIdx.x;

    int by = blockIdx.y;

   const int threadsPerRow = blockDim.x;

    const int rowsPerBlock  = blockDim.y;

    const int threadsPerBlock = threadsPerRow * rowsPerBlock;

   int tx = threadIdx.x;

    int ty = threadIdx.y;

   int idx = (((by * blocksPerRow) + bx) * threadsPerBlock) + ((ty * threadsPerRow) + tx);


    result[idx] = vector1[idx] + vector2[idx];


VectorAddition_f.pdf (30.9 KB)

PDF shows that for large number of elements GPU is much faster, isn’t it?

It is okay that you don’t get noticable speedup for small datasets (I consider 1000 floats a small dataset, actually 1000000 is also pretty small for GPU).
Calling GPU kernel have much higher overhead than calling CPU function and this is one of reasons whyCPU is faster on small number of floats.

Besides, you won’t get real good GPU performance with such simple kernel: you have only one addition for three memory accesses (two reads and one write)! This means that your kernel is bounded by memory bandwidth and won’t run faster than approximately 6 billion additions per second (70GB/sec of bandwidth divided by 3*4).

If you add more math to your kernel you’ll find amazing thing: it won’t slow down (until math complexity overweights memory access costs, of course).

Spikes on your graph are probably caused by some display activity (moving mouse, typing text, refreshing window and so on).

I don’t see anything strange, really. GPU is just a bit specific.

Thanks for thees information … i am going to think about this now…

but why is Calling a GPU kernel so much higher overhead than calling a CPU function??

Calling a GPU kernel requires a context switch into the graphics driver, communication over the PCI-Express bus, and some kind of busy loop so the CPU knows when the GPU is finished. Additionally, the first time you call a kernel, the driver has to do some processing on the compiled kernel code itself.

In comparison, SSE instructions have essentially no overhead. Adding two 1000 element vectors only requires 250 SSE additions (+500 SSE loads and 250 SSE stores), which can happen very quickly on a modern CPU. There is no mode switching or interrupt handshaking with an external device.

I thought that the bad timing results is caused by the occupancy factor. I mean if the code isn’t arithmeticly intensive, block and grid dimensions become small, the memory latency cannot be hidden which means that memory transfers. global to/from registers, shared/local memory are dominant. ??? Didn’t know that context switches are a major factor.

The only thing that calling a kernel does is write some values into the command FIFO of the card. There is no “context switch” or other stuff involved (might be another story if you use OpenGL interoperability) This is all completely done from user space.

The busy loop only runs for as long as the kernel executes on the GPU, so is no issue when you’re doing benchmarks.

question for you, let’s say i’m dealing with HiDef movie of 4k frames, 10-bit, compressed, 4190x219 pixels, say 8800GT/512MB @650MHz, quad CPU @3.2GHz, 4Gb ram, XP sp2, FSB of 1300MHz. how many operations could i do before bogging down down card? assume nothing else going on, just grind data.


The answer is : “Trying is knowing”

Your question is really hard to answer, it depends on too many factors.

let’s see if i can simplify, if the X% of the on board ram has just gotten some data. that leaves 1-X% for copy, move, additions, anything that can be done in-line.

if there are Y cores with Z pipes, how many operations can be done?

if the onboard ram is completely full, how many in-line operations can be done?

how FPGA like is the GPU architecture?

If the ram is completely full you can not do anything, since you have to fetch the data you are going to process from global memory.

If you have enough memory to store your input and output you can do all your processing, it is just that video-card operation will take also time and it will in my guess be interleaved with CUDA-processing

ok what about the partially full at X%, the data can be buffered around 1/X times between one memory allocation and the next. the thousands of parallel pipes should be able to do how many simple (no inversions) math operations? how many adds? mulitplications? all at clock speed.

i want to make sure that the GPU is 100% hammered but not overloaded (hogging up system ram waiting for GPU to finish)

You are really looking at this too simply. You will almost never ever get full GFLOPS performance in a kernel due to global memory latency, etc. So the peak performance that is theoretically achievable is never achieved in reality.

I am also not aware of a way to overload the GPU. One kernel will run at a time, and you cannot start a second one until the first one is finished…

well maybe overloaded needs a better definition.

if you have an FPGA like Xilinx for example, you know exactly how many and what type of math operations you can do, as the scheduling and latency are deterministic especially under an OS like QNX or other RTOS. you cannot overload a hardware based PLD, it is raw FIFO. now with with a non-RTOS, like DoZe, the system cannot process that fast with all of the overhead, so you have to buffer data and wait. a good example are hardware based radar processing using FFT boards, you know to do a 1024 point FFT how many operations it will take on double precision data.

with these GPU calculations it is unclear to me the best strategy to get maximum performance. there must be an optimal number of math operations for each amount of memory. i was using that cuda occupancy calculator but will admit my understanding of what it is telling is limited. it talks of threads, warps and block, whereas i want to know the amount of math that it is doing.

I would suggest you buy a CUDA capable card, and just start programming. It is not as clear-cut with CUDA I am afraid. The performance you can get is highly dependend on your algorithm. So I am afraid you will not be able to predict the performance of your algorithm without actually implementing it, benchmarking, adjusting parameters like number of threads per block, number of blocks. Changing code to make better use of shared memory, making accesses aligned. Benchmarking again.

As I read you want to do some processing on a ‘continuous’ stream of data it might be advantageous for you to use a Compute 1.1 Capable card (G92 based) as you can transfer data to the GPU while processing the previous dataset. These devices are for now slower in computing, but this overlap might be getting you more performance in the end.

But as you understand, you’ll have to buy a card or 2 and have a pilot project.