What's the benifit of more than one local threads ?

Hey …

Can you help me why nVidia chooses to implement more than one thread in local pool ?, is that can solve some problems easy ?, … i am using Intel SSE (vectors) and still thinking if there’s some problems that can be too easy to implement on CUDA , and using the advantage of more than one thread in the local pool , In fact i have see in CUDA there’s no speedup from using vectors, so why nVidia choose to implemet many threads ? could you give me an example which hard to deal in Intel SSE, and easy on CUDA ?

Thanks in advance … :)

Not really understanding your question, but maybe I can help.

CUDA GPUs can have the thread-state of 1024 threads on a chip that has 8 calculation units. The reason they do this is global memory latency. So it is beneficial to have a lot of active threads, so threads that are not waiting for data from memory can run. This is called latency hiding.

Personally, using one element at a time, instead of putting everything in vectors is more simple. If I compare my CUDA code to my vectorized MATLAB code, I find the CUDA code more readable and clear.

That’s my question, Is there’s an example that is too hard to do it in SSE and it will be too easy in CUDA threads ?

the second question > why nVidia choose to implement local pool thread (as group) … what the benfit from that ? In other words if we have about 1000 thread at total , so we can use it as 1000 thread in global pool (evrey thread do some of the work) … but what i have see in CUDA that the 1000 thread will be used as 10 Groups evrey group has 100 thread … so what’s the benefit , I prefer to give me an examples that the last Design can solve the problem too easy while the first design can’t …

SSE is a set of vector instructions, but they aren’t general. They tend to be math ops.

So for some cases, CUDA and SSE style vector programming are similar.

In CUDA,

x=x+4;

SSE:

_m128 temp=_mm_set1_epi32 (4);

x=__mm_addps(x, temp);

Logic ops are possible in CUDA and SSE, but simple CUDA code like

if (x>3.0) y+=z;

tends to become awkward in SSE, with multiple masks for setting just some bits and not others. The code even for simple cases like the above is no longer readable without some thought.

But the comparison finally breaks down… CUDA is general and pretty much full C, but SSE is just logic ops.

In particular, a simple common pointer indirection in CUDA like:

x=y[a];

Just isn’t directly possible in SSE. You have to break out of SSE and start writing loops yourself.

[Hmm, though maybe there’s new ops in SSE4 or AVX I don’t know about…]

Anyway, think of CUDA as being readable and automatically transparently vectorized.

SSE is parallel math, which is hard to debug and is only for math/logic operations, not all code.

So, don’t even try to compare SSE and CUDA, they’re different beasts.

thanks alot …

So what about the second question … anyone can help me ? :)

In other words is there’s an example that I need more than one thread in the local pool ?

You are talking about a parallel programming language so the question why you need more than one thread is a bit strange.

If you mean by local pool a thread block, then the answer is: look at the reduction example and how it uses the synchronization between threads in a block.

This is apart from the way the hardware works, because that already needs more than one thread per block.

I have talked about local pool … which means local block of threads … I have putted an example of my question also … my question was why I need more than one local thread … I still need more than one thread in the program but they can be on global blocks (which means there’s no sync. between them) .

I have take a look at Reduction, and i think that it could be implemented by global block of threads.
you can allocate 1000 global thread, evrey thread will do the work on 1/1000 from the orginal array,
so after that i will got 1000 results (result from each thread) … I can do serial redution on them (because it’s small number just 1000 ) … and that’s it …

Yes, and it would be waaaay slower than how the reduction example works. There you have it. The reason for more than 1 thread in a block: the fast, low latency, shared memory that is accessible from all threads in a block.

I think you should try to understand the SDK examples like reduction and scan, and what they do (and why), then it will all become clear. On the cuda zone (www.nvidia.com/cuda) you can find links to presentations and course material that will also help to understand why you want more than 1 thread in a block.

Ok that interests me. Can you compile the stuff with CLR support (in Windows XP/Vista, Visual Studio 2005)?

Before I had a look at CUDA, I tried to use SSE, too.

My MFC program using SSE ran without problems, but as soon as I copied the SSE code to a CLR console

app I got access violations all the time. A guy from blackbeltvb@yahoogroups.com told me you couldn’t use

SSE in combination with CLR. So how did you do it?

I think that you are talking now relating to nVidia arthticture, because in Intel Arth. i can simply divide the array to 2 side (core 2 duo - 2 threads) … and evrey thread will do it’s side, and it will work too well … thus I think that this issue is just for nVidia (arth. problem) … is that right ?

for example pseudo code in intel arth sould be

for evrey thread

int_vector_4 accumulator = 0;

loop

    accumulator += GET_THE_NEXT_FOUR_VALUES_FROM_ARRAY

 end loop

Yes I am talking about how to do it with CUDA, as the reduction example is written for that. I thought you wanted to know why you would want to have more threads per block in CUDA.