few questions

Hello,

few weeks ago I have started reading and coding CUDA.
few issues bother me , since I couldnt find answer from reading.

  1. can I run few Kernels simultaneously?
  2. since coputation done on the device(graphic card) and graphic card also used for display
    then what we have to do in order the display wont jam while calculation ?
  3. how can I control at which Multi Processor my kernel will run?
  4. can two threads read from the same var/register/memory(global/shared) simultaneously?
  5. what is unrool , why we need it?
  6. how we can solve Bank Conflicts ?
  7. half-warp = ?in details, how to control it?

Thanks
Miki

  1. since coputation done on the device(graphic card) and graphic card also used for display
    then what we have to do in order the display wont jam while calculation ?

I think the GPU itself can deal with both the display and cuda computation well.
But it should be noticed that if your cuda program uses lots of memory, heavy display will make the cuda program faillure because display is prior to the cuda.

Hi!

No. Only one kernel can execute at a time. On later cards you can do memory copies and kernel execution simultaneously.

One trick you can do to get around this is to split a kernel with an if/else so that if the thread index is greater than a certain value a completely different execution path is taken. You have to be careful not to get divergences in this switch and register usage can be a problem if one path is a lot more than the other (the first path may be unnecessarily slowed) but it’s doable.

If you accidentally write out of bounds bad things may happen to the display. If you change the memory the display requires when running a CUDA program (by doing a mode switch) you can cause problems with the CUDA. You might see slowdowns if you’re using funky GPU accelerated desktop effects.

You can’t. All the threads in one block are guaranteed to run on the same multiprocessor however.

The access restrictions are defined in the Programming Guide.

Unrolling loops allows you to reduce overhead caused by iterating through loops. Instead of iterating at runtime it does it at compile time. This can cause significant time improvements (though can make no difference) and can save registers. I’d google “loop unrolling” - there are plenty of results that explain the process.

See Programming Guide.

Thanks Tigga,
you help me very much.

may I ask another?

If I allocate shared memory inside/outside the kernel :

shared int s_mem[width][height];//size widthheightsizeof(int)

extern shared int s_mem;//size widthheightsizeof(int)

what is the diffrence? what is better todo? …

Thanks again
Miki