How to use just the GPU cores without any threading

How to use just the GPU cores without any threading. i have an algorithm that needs to be run “really simultaneously”, not using threads. i want to run it on different cores of GPU without having more than one program on one node. so how can i execute this kernel to be able to run on different nodes but single thread on each core. what should be the grid size and block size?

Thanks in advance,
Sadegh

Threading is the basic premise of the CUDA programming model. You can’t have CUDA without threads. But that doesn’t imply the CUDA threading model is like multithreading on an SMP computer. I think you need to go back and re-read the section of the programming guide that describes the execution model, because the question you are asking doesn’t make much sense: there is no concept of “nodes” or “cores” in the CUDA programming model, so it isn’t obvious what you are asking about.

It doesn’t sound like your algorithm is very cuda friendly, but in any case

Each block is scheduled to one multicore, i.e it can’t be split. Different blocks are spread between the multicore. If you don’t create more blocks than multicores then each block should go to it’s own mulitcore (i.e 30 blocks on a gtx285 for example). Problem is that different cores don’t run really simultaneously, they are independent. A warp, or half warp, or 8 threads, depending on what you are trying to do will run “really simultaneously” unless they are serialized due to memory access patterns or conditionals.

If you set 1 thread per block you will get 1 thread per multicore (very wasteful as you will actually be running 16 but dumping 15 results), setting the same number of blocks as sm will get you one block per sm, but doing this means no latency hiding, no very wasteful resource usage etc. and is probably not going to perform in way that justifies GPU (my guess is that it will be worse than a single core on the CPU)

You would need to better explain what you need so that we can see if it’s possible to help you or if you actually need a single core CPU to run the algorithm in a serial manner.