I will implement an algorithm that is mostly hard to parallelize through lightweight threads contradictory with SIMP architecture. Therefore kernel generally serializes by warp scheduler. In this case i think number of multiprocessors, in other words number of total warps supported concurrently, is more important than other constraints for performance because warps in a multiprocessor run asynchronously.
In GTX 480 (1344.96 GFlops), there are 15 multiprocessors → 15 * 48 = 720 concurrent warps
In GTX 470 SLI structure (1088.64 GFlops * 2), there are 14 * 2 multiprocessors → 14 * 2 * 48 = 1344 concurrent warps
I didn’t find chance to test these two systems. How much performance difference between these two systems for my case? I guess %40 of my algorithm is parallel-able regarding SIMP. Is this ratio efficient for GPU computing using Cuda?
Yes, occupancy is another factor for performance but occupancy regarding block size, register and shared memory will be the same between these two systems which have compute capability 2.0. In my case, total number of active warps that run parallelly and independently is a really important keypoint because of kernel’s limited parallelism in a warp. In other words, benefit from warp parallellism per multiprocessor rather than thread parallellism per warp. I want to be sure that GTX 470 SLI put forward a distinct performance gain over single GTX 480 for serialized instructions because of total active warp capacity. Also GTX 480 is forced through hardware limits that causes overheat to high temperatures under longer computations and i think GTX 470 is more stable than GTX 480.
In conclusion, my assumptions and calculations may be wrong. Maybe i have to consider max number of resident blocks and number of cores per multiprocessor rather than warps for comparison between these two systems according to serialized flows. Maybe i have to think block parallelism rather than warp parallelism, but ratios between these two approaches will be same for two systems (15/(14*2)).
GTX 480
According to max block: 15 MP * 8 = 120 resident blocks
According to core: 15 MP * 32 = 480 cores
GTX 470
According to max block: 14 MP * 2 * 8 = 224 resident blocks
Depends on what you consider efficient (or worth the effort). If you are happy with a <a target=‘_blank’ rel=‘noopener noreferrer’ href='"http://en.wikipedia.org/wiki/Amdahl’s_law"'>speedup of less than 66%, go ahead.
If you have a kernel where each thread in a warp will take a different code, execution will take 32x as long. That is likely to negate any advantage CUDA has over a CPU version.
It is impossible to parallelize my algorithm completely through SIMP architecture as lightweight threads because of path divergence issues, conditional flows. My kernel has lots of dynamic conditions that don’t depend on treadIdx variable and has lots of dynamic inner loops under these conditions. According to SMIP, threads in same warp (or block) diverge and serialize on like this kind of flows and try to converge at same common instruction during process.
But there is another point in path divergence, i know every warp (or block) run independently and asynchronously. I understand that algorithm still parallelizes in path divergence among warps (or blocks) though doesn’t parallelize among threads. My question is about this. Amdahl’s Law give us improvement value for parallel flows versus serial.
Have you considered using a different algorithm? There are usually many ways to address the same problem, some of which are more parallel-friendly than others.
Yes, i tried different flows. Because if-conditions depends on matrix values and number of loops are changeable through these values, path divergence always occurs and #pragma unroll cannot be used.