Last instalment for the moment, a puzzle instead of your Sunday crossword - updating the test to:
#include <stdio.h>
#include <cutil.h>
#define RESULTS 4
__global__ void
testKernel(int* result)
{
int t1, t2, t3, t4;
__shared__ int x, y, z;
z = clock();
__syncthreads();
t1 = clock();
__syncthreads();
x = z + 534;
__syncthreads();
t2 = clock();
__syncthreads();
y = x % 624;
__syncthreads();
t3 = clock();
__syncthreads();
z = y * 729;
__syncthreads();
t4 = clock();
__syncthreads();
if (threadIdx.x == 0)
{
result[3] = z; // Result is used
result[0] = t2 - t1; // Total add/sync clock shared read&write time
result[1] = t3 - t2 - (t2 - t1); // mod over add
result[2] = t4 - t3 - (t2 - t1); // mul32 over add
}
}
int
main(int argc, char** argv)
{
int result[RESULTS];
int* d_data;
CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, sizeof(int) * RESULTS));
dim3 grid;
dim3 threads;
testKernel<<<grid, threads>>>(d_data);
CUT_CHECK_ERROR("Kernel execution failed");
testKernel<<<grid, threads>>>(d_data);
CUT_CHECK_ERROR("Kernel execution failed");
CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));
printf("%8d%8d%8d%8d\n", threads.x, result[0], result[1], result[2]);
for (threads.x = 32; threads.x <= 512; threads.x += 32)
{
testKernel<<<grid, threads>>>(d_data);
CUT_CHECK_ERROR("Kernel execution failed");
CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));
printf("%8d%8d%8d%8d\n", threads.x, result[0], result[1], result[2]);
}
return 0;
}
gives on 0.8 release (so double every number for new GPU clocks):
1 52 908 48
32 172 914 50
64 180 908 50
96 180 904 50
128 188 912 162
160 310 914 46
192 326 964 48
224 366 1008 2
256 338 1082 68
288 386 1138 74
320 450 1174 68
352 470 1286 76
384 488 1404 120
416 532 1502 126
448 584 1558 114
480 700 1652 68
512 646 1798 150
1 52 892 48
32 172 914 50
64 180 906 50
96 180 912 50
128 188 918 162
160 310 936 46
192 326 1000 48
224 366 956 62
256 338 1048 70
288 380 1134 80
320 450 1190 64
352 470 1310 74
384 488 1412 106
416 532 1508 126
448 578 1616 120
480 692 1658 76
512 646 1776 158
There is lots of info one can deduce from these numbers, some of which goes against the obvious interpretation of the guide.
The single threaded run cannot have any register conflicts, shared memory conflicts or anything else like that so it is likely pretty clean. If we trust the guide that mul32 takes 4x as long as add then we can conclude the minimum instruction cycle time of an MP core is 32 new GPU clocks. 8 times slower than one is led to assume. No wonder single threaded code runs so slowly… As warps are added things don’t change a lot till those 32 clocks are used up. Makes me think that when a warp stalls it probably still uses instruction cycles.
Given this assumption the mod op definitely looks like 58 or so instruction cycles.
The other interesting thing is the time for the add single threaded is only 104 clocks (clock to clock) and the mul32 time is an additional 96 so we can conclude the kernel launcher deleted the bar.wait instructions, as it is entitled to do within a warp (does mean that the code is being reloaded on each kernel call - latency) but there are only 8 clocks for the load clock into register op… only thing that does not add up??
Next interesting thing is the add clock-clock time for 1 warp is incremented by 240 new GPU clocks over the single threaded version. Now the launcher still would not have put the bar.wait instructions back in yet and since this app uses 14 registers the loader should have allocated them to separate register banks so there should be no register bank conflict (assumes 16+ register banks and same technology as shared memory) there should be no read after write or write after read conflicts as there is plenty of time between setting and using registers and shared memory. This leaves the only possibility is shared memory write conflict. Here we have a 32 way bank conflict with a full register write cycle time - gives 2 whole instruction times as the write cycle time of shared memory (8 clocks) and blows my theory that if one was designing hardware to do this one has to decide which order to do the writes therefore one would just do the last, as all the others are redundant. This is the second big surprise.
Any other ideas or further interpretation as number of warps increase? There is certainly a lot going on here and it is a complex set of equations to solve. (I did not run this code).
Eric
ed: Just though a good reason for the same time for mul32 and add at one run of 224 threads is that the hardware detects top bytes zero and does a mul24 for you so there is no need to explicitly code mul24…
ed: (to save a post) quick revisit and my suggestion is that at less than 33% occupancy the G80 is throttled by instruction fetch bandwidth. Seems to go into a funny mode like a super warp of 8 warps (not strictly but 8 warps are run taking the full instruction cycle times even though as many as 7 are doing nothing). The load clock does add up as it does not need ALUs (direct broadcast of the hardware clock reg to 32 thread registers) and so it can happen in 1 clock which implies 32 banks in the register store - believable as it is twice as big as shared memory and an obvious design decision to minimise conflicts.
The write conflict to shared mem is definitely there and appears to take 32 clocks for a 32 way bank (bus) conflict. It seems the hardware has to get all the data to decide what to write (as suggested by Good programing thread and the current compiler’s use of these concurrent writes, even when it is explicitly coded as a single thread write - could actually be a bug). Not as bad as mentioned above, but good to know the cost as 8x cost of writes to separate shared memory locations without bank conflict, if all threads are running (code not divergent). I expect this delay could get much less in future revisions of the hardware, if the bus conflict cycles can be aborted as soon as the result is known. This may already be the case, and writing the same value from all 32 threads is the worst case scenario requiring all data to be checked.
Along the way I concluded that syncs do not cost 4 clocks, they only cost to the last warp to reach them and then probably less than 4 clocks.