Here are the kernel execution time for 128 threads:
128x1: ~0.0932 ms
64x2 : ~0.0931 ms
32x4 : ~0.0932
16x8 : ~0.0939 ms
8x16 : ~0.0937 ms
4x32 : ~0.0938 ms
2x64 : ~0.0946 ms
1x128: ~0.160 ms
I guess the differance between 64 threads and 128 threads for 8x8 vs 16x8 has to have something to do with register spilling to local memory as insmvb00 suggested…
For those of you that start of by reading this post:
The difference in time between the different configuration has already been explained in the previous post.
Here are the kernel execution time for 128 threads:
128x1: ~0.0932 ms
64x2 : ~0.0931 ms
32x4 : ~0.0932
16x8 : ~0.0939 ms
8x16 : ~0.0937 ms
4x32 : ~0.0938 ms
2x64 : ~0.0946 ms
1x128: ~0.160 ms
I guess the differance between 64 threads and 128 threads for 8x8 vs 16x8 has to have something to do with register spilling to local memory as insmvb00 suggested…
For those of you that start of by reading this post:
The difference in time between the different configuration has already been explained in the previous post.
I have tried to change the maxrregcount (it was set to 32 which should be enough ) to 64 but I still see the same amount of st.local and ld.local in the ptx output (I’m using visual studio). If I set the maxregcount to 8 it should spill more register memory to local memory, right? But it still show the same amount of st.local and ld.local.
I have tried to change the maxrregcount (it was set to 32 which should be enough ) to 64 but I still see the same amount of st.local and ld.local in the ptx output (I’m using visual studio). If I set the maxregcount to 8 it should spill more register memory to local memory, right? But it still show the same amount of st.local and ld.local.
When I change the maxregcount to 8 I saw (I missed it the last time) that the lmem from the ptxas output increased ( from 56 to 92 ), hence (as you said) it had to put some of the registers need by the thread in local memory It did however not see any difference in the ptx-file that I generated, this is what made me confused=)
Does st.local and ld.local always mean that some registers has to be put in local memory that could be avoided?
If I compile it with maxrregcount 14-64 it shows a local memory usage for each thread for 56 bytes ( I dont know if that was your question)
I tried to -use_fast_math and as you said it did improve the kernel execution time and the ptxas output showed that it only needed 8 registers instead of 14 as before and no lmem was presented.
So if I got this right:
The 56 bytes of local memory is some registers that has been spilled to the local memory? If so do you think that this is a result of (as you once again mentioned before) a pro-cation by the compiler since I’m so close to 16K register space?
Once again thank you for clearing things up for me External Image
When I change the maxregcount to 8 I saw (I missed it the last time) that the lmem from the ptxas output increased ( from 56 to 92 ), hence (as you said) it had to put some of the registers need by the thread in local memory It did however not see any difference in the ptx-file that I generated, this is what made me confused=)
Does st.local and ld.local always mean that some registers has to be put in local memory that could be avoided?
If I compile it with maxrregcount 14-64 it shows a local memory usage for each thread for 56 bytes ( I dont know if that was your question)
I tried to -use_fast_math and as you said it did improve the kernel execution time and the ptxas output showed that it only needed 8 registers instead of 14 as before and no lmem was presented.
So if I got this right:
The 56 bytes of local memory is some registers that has been spilled to the local memory? If so do you think that this is a result of (as you once again mentioned before) a pro-cation by the compiler since I’m so close to 16K register space?
Once again thank you for clearing things up for me External Image