I want to Implement 10.000 Cores in GPU, each making an arithmetic equation, is possible to do: I wi

I’m choosing the topic to make my Thesis. Please let me know your opinion about how feasible and convenient to do this under the CUDA-GPU Technology.

I want to implement 10.000 Cores, each of them calculating parallelly and independently an equation like this (for example) : A = B^2 + 1 / C, where A, B and C are real always positive numbers: 15.38, 0.459988, etc.

when all 10.000 cores finish, they communicate with each other in a simple “torus” network to summarize their results; and start again.

Roughly, can you say this makes senses to do in CUDA? Using a GPU and CUDA development software, how difficult would it be to implement? My programming skills are fair (assembly, C), but I have never worked with CUDA or GPUs, I’m not sure about the Technology limitations and other aspects.

I count with few months to finish a project like this, That’s why I ask here, to see the feasibility of this project.

Finally, do you know a another forum for CUDA beside NVIDIA website?

The threads (“nodes”) in CUDA aren’t connect like a Torus interconnect in other supercomputers. The threads run in units called blocks. Within each block, threads can exchange data either through shared memory (which is quite fast) or within groups of 32 threads (called warps) at warp speed ;)

Also note that 32 threads (warps) always operate on the same instruction because they share an instruction scheduler. So it’s best to avoid divergent code paths in a warp (it causes serialization and loss of efficiency). So it’s actually best if the 32 threads compute the same arithmetic equation (with different parameters is OK though).

Data Interconnects among arbitrary threads residing in different blocks can also be realized, but these are more difficult to do properly. This requires synchronizing data through the (relatively slow) global memory of the graphics card in most cases. So for best performance you should group the threads requiring frequent data exchange within the same block or warp.

Hope that helps.


@cbuchner1, definitely very interesting. From what you are saying, GPU CUDA should fit to my needs, because I could study the perfomance level of the warps vs thread-interconnecting in my application.

I forget to mention, that in fact, every Core (thread?) will be performing exactly the same equation. But for your answer I can see, I would be limited to 32 threads for taking advantage of this aspect.

Could you comment something more specifically about Implementing the 10.000 blocks? Suppose, there is not any speed limitation, in fact, my thesis is just about to functionally implement the application and “study” the effects of my technology decision.

Could you comment also about the complexity of the equation above, can it be managed by 1 thread with out any problems?

Launching e.g 39 blocks of 256 threads each will give you about 10000 threads in total.
Each of these can solve the equation independently with different parameters.

I propose you look at a few CUDA tutorials to see how the programming model of CUDA works,
in particular familiarize yourself with the concept of a grid, blocks, warps and threads.

Each thread knows exactly which block it’s in and what the thread ID within each block is.

Also note that 10000 threads is not a big workload for a GPU at all.


are you exposed to “task parallelism”? GPU computations is a sort of that - you give GPU a lot of independent tasks and once they all finished, you give it a next group of tasks. since tasks can be executed in arbitrary order, communication between them should be limited (although you can use atomics to summarize data or allocate resources)

also, modern GPUs has the speed of around 1-10 TFLOPS, i.e. 10^12 or 10^13 operations/second. so your computation will finish in less than 1 microsecond - probably less than overhead of command queueing