What's the benifit of more than one local threads ?

Admirer4 · December 25, 2008, 7:46pm

Hey …

Can you help me why nVidia chooses to implement more than one thread in local pool ?, is that can solve some problems easy ?, … i am using Intel SSE (vectors) and still thinking if there’s some problems that can be too easy to implement on CUDA , and using the advantage of more than one thread in the local pool , In fact i have see in CUDA there’s no speedup from using vectors, so why nVidia choose to implemet many threads ? could you give me an example which hard to deal in Intel SSE, and easy on CUDA ?

Thanks in advance … :)

E.D_Riedijk · December 26, 2008, 7:26am

Not really understanding your question, but maybe I can help.

CUDA GPUs can have the thread-state of 1024 threads on a chip that has 8 calculation units. The reason they do this is global memory latency. So it is beneficial to have a lot of active threads, so threads that are not waiting for data from memory can run. This is called latency hiding.

Personally, using one element at a time, instead of putting everything in vectors is more simple. If I compare my CUDA code to my vectorized MATLAB code, I find the CUDA code more readable and clear.

Admirer4 · December 26, 2008, 9:05am

That’s my question, Is there’s an example that is too hard to do it in SSE and it will be too easy in CUDA threads ?

the second question > why nVidia choose to implement local pool thread (as group) … what the benfit from that ? In other words if we have about 1000 thread at total , so we can use it as 1000 thread in global pool (evrey thread do some of the work) … but what i have see in CUDA that the 1000 thread will be used as 10 Groups evrey group has 100 thread … so what’s the benefit , I prefer to give me an examples that the last Design can solve the problem too easy while the first design can’t …

SPWorley · December 26, 2008, 11:23am

SSE is a set of vector instructions, but they aren’t general. They tend to be math ops.

So for some cases, CUDA and SSE style vector programming are similar.

In CUDA,

x=x+4;

SSE:

_m128 temp=_mm_set1_epi32 (4);

x=__mm_addps(x, temp);

Logic ops are possible in CUDA and SSE, but simple CUDA code like

if (x>3.0) y+=z;

tends to become awkward in SSE, with multiple masks for setting just some bits and not others. The code even for simple cases like the above is no longer readable without some thought.

But the comparison finally breaks down… CUDA is general and pretty much full C, but SSE is just logic ops.

In particular, a simple common pointer indirection in CUDA like:

x=y[a];

Just isn’t directly possible in SSE. You have to break out of SSE and start writing loops yourself.

[Hmm, though maybe there’s new ops in SSE4 or AVX I don’t know about…]

Anyway, think of CUDA as being readable and automatically transparently vectorized.

SSE is parallel math, which is hard to debug and is only for math/logic operations, not all code.

So, don’t even try to compare SSE and CUDA, they’re different beasts.

Admirer4 · December 28, 2008, 6:06am

thanks alot …

So what about the second question … anyone can help me ? :)

In other words is there’s an example that I need more than one thread in the local pool ?

E.D_Riedijk · December 28, 2008, 7:49am

You are talking about a parallel programming language so the question why you need more than one thread is a bit strange.

If you mean by local pool a thread block, then the answer is: look at the reduction example and how it uses the synchronization between threads in a block.

This is apart from the way the hardware works, because that already needs more than one thread per block.

Admirer4 · December 31, 2008, 7:49am

I have talked about local pool … which means local block of threads … I have putted an example of my question also … my question was why I need more than one local thread … I still need more than one thread in the program but they can be on global blocks (which means there’s no sync. between them) .

I have take a look at Reduction, and i think that it could be implemented by global block of threads.
you can allocate 1000 global thread, evrey thread will do the work on 1/1000 from the orginal array,
so after that i will got 1000 results (result from each thread) … I can do serial redution on them (because it’s small number just 1000 ) … and that’s it …

E.D_Riedijk · December 31, 2008, 10:19am

Yes, and it would be waaaay slower than how the reduction example works. There you have it. The reason for more than 1 thread in a block: the fast, low latency, shared memory that is accessible from all threads in a block.

I think you should try to understand the SDK examples like reduction and scan, and what they do (and why), then it will all become clear. On the cuda zone (www.nvidia.com/cuda) you can find links to presentations and course material that will also help to understand why you want more than 1 thread in a block.

Louis_Coder · December 31, 2008, 1:03pm

Ok that interests me. Can you compile the stuff with CLR support (in Windows XP/Vista, Visual Studio 2005)?

Before I had a look at CUDA, I tried to use SSE, too.

My MFC program using SSE ran without problems, but as soon as I copied the SSE code to a CLR console

app I got access violations all the time. A guy from blackbeltvb@yahoogroups.com told me you couldn’t use

SSE in combination with CLR. So how did you do it?

Admirer4 · December 31, 2008, 5:02pm

I think that you are talking now relating to nVidia arthticture, because in Intel Arth. i can simply divide the array to 2 side (core 2 duo - 2 threads) … and evrey thread will do it’s side, and it will work too well … thus I think that this issue is just for nVidia (arth. problem) … is that right ?

for example pseudo code in intel arth sould be

for evrey thread

int_vector_4 accumulator = 0;

loop

    accumulator += GET_THE_NEXT_FOUR_VALUES_FROM_ARRAY

 end loop

E.D_Riedijk · December 31, 2008, 5:21pm

Yes I am talking about how to do it with CUDA, as the reduction example is written for that. I thought you wanted to know why you would want to have more threads per block in CUDA.

Topic		Replies	Views
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8210	February 17, 2015
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204414	April 13, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9610	January 1, 2009
Cannot get any stream parallelism. CUDA Programming and Performance	13	1306	December 31, 2019
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15006	December 17, 2008
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134735	May 26, 2010
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8642	December 18, 2008
Fast DIy device emulation Introductory howto CUDA Programming and Performance	9	7945	June 28, 2008
Sequential code is faster than parallel, how is it possible? CUDA Programming and Performance	8	9403	August 3, 2016
CUDA Memory Consistency CUDA Programming and Performance	23	55619	March 8, 2007

What's the benifit of more than one local threads ?

Related topics