How many threads run in parallel at a given moment?+PCI Express Bus.

I am using CUDA C in my project and I have three major questions:

  1. I want to justify the performance that I get, but I don’t understand how many threads can actually run in parallel in a given moment, not being scheduled, but physically be running at the same time?

  2. Related to the first question, can one core execute more than one thread in parallel (again, not scheduled, but really at the same time)?

  3. I would like to estimate practical values for PCI Express bus of types: 2.0X16(with GPU Quardo 600) and 3.0X16(with GPU GT650M), (According to what I read, theoretical values are 8GB/s and 16GB/s, respectively).

thanks in advance


On a well configured system, you will see PCIe transfer rates of 6 GB/sec for PCIe gen2 and 12 GB/sec for PCIe gen3 where 1 GB/s = 1e9 bytes / second for large transfers (say 16 MB). This is for transfers in one direction. PCIe supports full duplex operation, so if your GPU has two copy engines (e.g. Tesla card) you can transfer in both directions simultaneously, however your host system may not be able to provide sufficient bandwidth to achieve full speed in both directions.

And if your GPU doesn’t have two copy engines, you can still overlap one copy with a kernel that uses mapped memory to perform the transfer in the opposite direction to achieve about the same throughput.

The answer to your first question is a lot more difficult, because it depends on what you consider “running at the same time”. For practical reasons of understand your kernel’s performance however we can mostly ignore these more esoteric questions. The most important factor to affect the kernel’s performance is the achievable throughput for each instruction, which is listed in the “Maximize Instruction Throughput” section of the Programming Guide.

If throughput considerations alone don’t explain your findings, the next important concept is that of latency. Each instruction needs a certain amount of time to perform it’s operation. Independent instructions following in the instruction stream are performed before the results of the previous instructions are available. However if an instruction dependent on one of the previous results comes up, execution has to wait for the result to become available.

Latency for instructions operating entirely in registers is about 22 cycles (less for Kepler. Latency for shared memory accesses is a bit over 30 cycles. Latency for L1 cache hit presumably is about the same. Latency for global memory access is somewhere between 400 and more than 1000 cycles, depending on memory bus load and a couple of other factors. Unfortunately these numbers are not published by Nvidia.

thanks to you all :)