njuffa: Let me first note that all the runs described in this post are on M2050 on Amazon EC2. M2050 has 14 SMs.
First I will give the timing results for my kernel, which follows:
__global__ void
__launch_bounds__(NUMTHREADS, 1536/NUMTHREADS)
copystride(double *list, int n, double *copy){
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int stride = blockDim.x*gridDim.x;
for(int i=tid; i < n; i = i + stride)
copy[i] = list[i];
}
This kernel is practically identical to the kernel you use. For timing I used cuda events (recommended by the Best Practices, Section 2.1).
The value of n used was
n=384*56*4000;//n is 86016000
This makes for perfect load balancing across the threads.
With 384 threads per block and 56 thread blocks, I measured a bandwidth of 89 GB/s.
With 384 threads per block and 56000 thread blocks, I measured a bandwidth of 100 GB/s. Both these numbers very consistent.
56 and 56000 are natural numbers for the number of thread blocks because M2050 has 14 SMs each of which can hold 4 thread blocks (14 X 4 = 56) and the chosen value of n gives perfect load balancing.
I accept your point that having a large number of thread blocks makes for better bandwidth. tera: “#pragma unroll 4” made no difference. I did not check the PTX to see if the compiler actually unrolled. My guess is it did not. When I unrolled explicitly, things got slightly slower.
Next I compiled your dcopy.cu.
nvcc --ptxas-options=-v -arch=sm_20 dcopy.cu -o dcopy.exe
When I ran it on the M2050, I go the following output:
[root@ip-10-17-129-108 bq-gpu]# dcopy.exe
dcopy: operating on vectors of 10000000 doubles (= 8.000e+07 bytes)
dcopy: using 384 threads per block, 26042 blocks
dcopy: mintime = 1.735 msec throughput = 92.22 GB/sec
On other runs the throughput reported was as low as 82 GB/s.
I noticed that you were using a CPU timer in your second() function. So following Best Practices, Section 2.1, I inserted cudaThreadsSynchronize() as the very first line of your second() (for the unix version only). I got the following output.
[root@ip-10-17-129-108 bq-gpu]# dcopy.exe
dcopy: operating on vectors of 10000000 doubles (= 8.000e+07 bytes)
dcopy: using 384 threads per block, 26042 blocks
dcopy: mintime = 2.571 msec throughput = 62.23 GB/sec
On other runs the reported throughput was as high as 71 GB/s.
-
I think using a power of 10 for the size of the vector is not good for load balance.
-
The peak bandwidth of M2050 is calculated using its clock speed of 1.15 GHz, two transfers per cycle and 512 bits in the width of the memory interface (Best Practices, Section 2.2.1). It works out to about 150 GB/s.
-
One would expect better bandwidth from a program that does more reading than writing. So I wrote a kernel which adds all the numbers in a list. With 56 blocks I got a bandwidth of 95 GB/s. With 56000 blocks I got a bandwidth of 108 GB/s. Note that with 56000 blocks, the kernel writes 56000*384 doubles to global memory so that each thread may save its result. That is a lot of writing.
-
My guess is one may approach the peak bandwidth if the program only reads to a register but never writes. I tried to write some inline assembly using PTX to do that, but could not get it to work. My impression is nvcc has only a limited implementation of inline assembly.
-
I would like to understand why having 56000 blocks instead of 56 helps. With 56 blocks, each thread block can be scheduled on an SM and the number of warps is (384/32)*56=672. Perhaps that is not enough to hide the latency to DRAM. I tried 560 blocks. Surely 6720 warps is enough to hide latency to DRAM. Yet the performance was only 90 GB/s and not 100 GB/s that I got with 56000 warps. What is going on? Why does using so many blocks help? I would expect that every a time block finishes running and another blocks is brought in, there is overhead to set up the registers of the new block. Why does that overhead not hurt? How much is that overhead? The NVIDIA documentation is completely silent on this issue.