GPU (Cuda) vs. SSE3 Results of my Vector Addition

Acid_2 · December 21, 2007, 11:36am

Hallo,

I have a performance problem… my vector additon result for 1000 elements are.

Normal: 0.00587469060 (msecs)

SSE3: 0.00281704427 (msecs)

only exekution (without copy operations and alocations)

GPU (Cuda): 0.08995556831 (msecs)

i think the gpu must be for 1000 elements faster.

The pdf shows the real comparing graphs for the sse3 & gpu by 1000 to 1000000 elements.

Whats my fault?

Please help me.

here my code:

CallerCode

CUT_SAFE_CALL(cutResetTimer(hTimer));

CUT_SAFE_CALL(cutStartTimer(hTimer));

VectorAddition<<< grid, threads >>>( vector1_gpu, vector2_gpu, result_gpu);

CUDA_SAFE_CALL(cudaThreadSynchronize());

performanceMetrics.exe_GPU_time += cutGetTimerValue(hTimer);

KernelCode

__global__ void VectorAddition(float* vector1, float* vector2, float* result){

    const int blocksPerRow = gridDim.x;

   int bx = blockIdx.x;

    int by = blockIdx.y;

   const int threadsPerRow = blockDim.x;

    const int rowsPerBlock  = blockDim.y;

    const int threadsPerBlock = threadsPerRow * rowsPerBlock;

   int tx = threadIdx.x;

    int ty = threadIdx.y;

   int idx = (((by * blocksPerRow) + bx) * threadsPerBlock) + ((ty * threadsPerRow) + tx);

	

    result[idx] = vector1[idx] + vector2[idx];

}

VectorAddition_f.pdf (30.9 KB)

AndreiB · December 21, 2007, 12:25pm

PDF shows that for large number of elements GPU is much faster, isn’t it?

It is okay that you don’t get noticable speedup for small datasets (I consider 1000 floats a small dataset, actually 1000000 is also pretty small for GPU).
Calling GPU kernel have much higher overhead than calling CPU function and this is one of reasons whyCPU is faster on small number of floats.

Besides, you won’t get real good GPU performance with such simple kernel: you have only one addition for three memory accesses (two reads and one write)! This means that your kernel is bounded by memory bandwidth and won’t run faster than approximately 6 billion additions per second (70GB/sec of bandwidth divided by 3*4).

If you add more math to your kernel you’ll find amazing thing: it won’t slow down (until math complexity overweights memory access costs, of course).

Spikes on your graph are probably caused by some display activity (moving mouse, typing text, refreshing window and so on).

I don’t see anything strange, really. GPU is just a bit specific.

Acid_2 · December 27, 2007, 12:09am

Thanks for thees information … i am going to think about this now…

but why is Calling a GPU kernel so much higher overhead than calling a CPU function??

seibert · December 27, 2007, 3:52am

Calling a GPU kernel requires a context switch into the graphics driver, communication over the PCI-Express bus, and some kind of busy loop so the CPU knows when the GPU is finished. Additionally, the first time you call a kernel, the driver has to do some processing on the compiled kernel code itself.

In comparison, SSE instructions have essentially no overhead. Adding two 1000 element vectors only requires 250 SSE additions (+500 SSE loads and 250 SSE stores), which can happen very quickly on a modern CPU. There is no mode switching or interrupt handshaking with an external device.

sicb0161 · January 8, 2008, 2:40pm

I thought that the bad timing results is caused by the occupancy factor. I mean if the code isn’t arithmeticly intensive, block and grid dimensions become small, the memory latency cannot be hidden which means that memory transfers. global to/from registers, shared/local memory are dominant. ??? Didn’t know that context switches are a major factor.

wumpus · January 8, 2008, 3:09pm

The only thing that calling a kernel does is write some values into the command FIFO of the card. There is no “context switch” or other stuff involved (might be another story if you use OpenGL interoperability) This is all completely done from user space.

The busy loop only runs for as long as the kernel executes on the GPU, so is no issue when you’re doing benchmarks.

G.Alexander · January 9, 2008, 9:49am

question for you, let’s say i’m dealing with HiDef movie of 4k frames, 10-bit, compressed, 4190x219 pixels, say 8800GT/512MB @650MHz, quad CPU @3.2GHz, 4Gb ram, XP sp2, FSB of 1300MHz. how many operations could i do before bogging down down card? assume nothing else going on, just grind data.

thanks

DenisR · January 9, 2008, 11:02am

The answer is : “Trying is knowing”

Your question is really hard to answer, it depends on too many factors.

G.Alexander · January 9, 2008, 11:18am

let’s see if i can simplify, if the X% of the on board ram has just gotten some data. that leaves 1-X% for copy, move, additions, anything that can be done in-line.

if there are Y cores with Z pipes, how many operations can be done?

if the onboard ram is completely full, how many in-line operations can be done?

how FPGA like is the GPU architecture?

DenisR · January 9, 2008, 12:52pm

If the ram is completely full you can not do anything, since you have to fetch the data you are going to process from global memory.

If you have enough memory to store your input and output you can do all your processing, it is just that video-card operation will take also time and it will in my guess be interleaved with CUDA-processing

G.Alexander · January 9, 2008, 2:22pm

ok what about the partially full at X%, the data can be buffered around 1/X times between one memory allocation and the next. the thousands of parallel pipes should be able to do how many simple (no inversions) math operations? how many adds? mulitplications? all at clock speed.

i want to make sure that the GPU is 100% hammered but not overloaded (hogging up system ram waiting for GPU to finish)

DenisR · January 9, 2008, 8:32pm

You are really looking at this too simply. You will almost never ever get full GFLOPS performance in a kernel due to global memory latency, etc. So the peak performance that is theoretically achievable is never achieved in reality.

I am also not aware of a way to overload the GPU. One kernel will run at a time, and you cannot start a second one until the first one is finished…

G.Alexander · January 10, 2008, 1:55am

well maybe overloaded needs a better definition.

if you have an FPGA like Xilinx for example, you know exactly how many and what type of math operations you can do, as the scheduling and latency are deterministic especially under an OS like QNX or other RTOS. you cannot overload a hardware based PLD, it is raw FIFO. now with with a non-RTOS, like DoZe, the system cannot process that fast with all of the overhead, so you have to buffer data and wait. a good example are hardware based radar processing using FFT boards, you know to do a 1024 point FFT how many operations it will take on double precision data.

with these GPU calculations it is unclear to me the best strategy to get maximum performance. there must be an optimal number of math operations for each amount of memory. i was using that cuda occupancy calculator but will admit my understanding of what it is telling is limited. it talks of threads, warps and block, whereas i want to know the amount of math that it is doing.

DenisR · January 10, 2008, 7:13am

I would suggest you buy a CUDA capable card, and just start programming. It is not as clear-cut with CUDA I am afraid. The performance you can get is highly dependend on your algorithm. So I am afraid you will not be able to predict the performance of your algorithm without actually implementing it, benchmarking, adjusting parameters like number of threads per block, number of blocks. Changing code to make better use of shared memory, making accesses aligned. Benchmarking again.

As I read you want to do some processing on a ‘continuous’ stream of data it might be advantageous for you to use a Compute 1.1 Capable card (G92 based) as you can transfer data to the GPU while processing the previous dataset. These devices are for now slower in computing, but this overlap might be getting you more performance in the end.

But as you understand, you’ll have to buy a card or 2 and have a pilot project.

Topic		Replies	Views
Sequential code is faster than parallel, how is it possible? CUDA Programming and Performance	8	9252	August 3, 2016
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6538	February 19, 2009
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6035	December 8, 2008
Can speed up ratio greater than the number of GPU processor cores? CUDA Programming and Performance	11	2734	June 3, 2010
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12108	February 10, 2010
Could GPU computing speed up computation in data-larger-than-RAM cases? CUDA Programming and Performance	5	1851	April 19, 2016
300x to 600x times faster... really? CUDA Programming and Performance	92	34393	February 8, 2010
Cuda code performance CUDA Programming and Performance	14	3118	December 16, 2014
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37241	August 30, 2009
Modern GPU CUDA Programming and Performance	30	5659	April 11, 2016

GPU (Cuda) vs. SSE3 Results of my Vector Addition

Related topics