When ptx code is generated, atomic fucntion is changed into atom or red instruction.
But, in my case atom instruction is always generated.
Can you give me an example of generating red instruction?
Well, I can’t. According to the PTX specification, [font=“Courier New”]red[/font] is not atomic. So whenever it would be generated, it would be in error.
I don’t think that’s true, my understanding is that the reduction instructions are just like the atomic instructions, except they don’t return the original value at the address.
So, in theory we could generate the red instructions when you don’t use the return value of one of the atomic* instinsics. I haven’t verified this, however.
The PTX isa says about the red instruction on page 137
while I would read appendix B.11 of the Programming Guide as asserting atomicity also versus normal stores:
Right, but red instructions would be considered atomic with respect to other red instructions, just not arbitrary st instructions. This rarely matters in real applications; how often do you end up doing something like this pseudo code:
if(threadIdx.x == 0)
{
reductionAdd(value, 1);
}
else
{
*value = 0;
}
?
I agree it rarely happens, but sometimes code like this does get written. For sure I don’t want the documentation be inaccurate just because somebody decides that it would rarely matter in practice.
If Nvidia ever decides to change the current behavior, I’d expect them to update the documentation accordingly and place a big caveat in the release notes.
__global__ void atomic_unit_test(int *g_data)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int nthreads = blockDim.x;
int id = tid + bid*nthreads;
if ( tid == 0 )
atomicAdd(g_data, 1);
else
*g_data = 0;
}
I think that your pseudo code and this code has the same meaning.
But, red instruction is not generated.
I agree that the documentation should be as unambiguous as possible, I was speaking more towards the usefulness of a red instruction as opposed to treating everything as atomic.
From a hardware perspective, the semantics of an atomic instruction require a round trip from an SM issuing a request to an atomic ALU attached to the last level cache that locks access to a cache line, services it, and sends the original value back to the SM. A red instruction requires only a one sided message to the atomic ALU that only needs to process one red at a time and can share access to the cache with other write ports without the need for locking. I have no idea if nvidia implements it like this, but other architectures do and there is generally a performance advantage when using one sided reduction instructions on these processors.
As with many other features, just because it is exposed in the ISA does not mean that the compiler will take advantage of it… I am really trying to push for an open compiler implementation for PTX so that people can take advantage of these features by implementing them if they really want them. There are some really cool new ISA features for fermi such as selective barriers and controlled cache accesses as well as some old ones that have been around since tesla (predication!) that are currently not being generated by nvcc as far as I can tell.
On the other hand, just because it is not exposed in PTX does not mean that ptxas will not take advantage of it. :)
In fact ptxas will turn atom PTX instructions into REDs (opcode D6) when the output register is not reused. (Which suggests that the documentation is inaccurate…)
Also, it uses predication a lot.
Target-dependent opcodes are cool, but I believe they should not be exposed in PTX when it can be avoided… In a perfect world, they would just get inferred automagically by the backend compiler.
Although I admit that we still need many of them in practice.
And unfortunately, ptxas is a black box…
That is interesting. So you could run into correctness issues with conflicting st/atomics if the destination register in PTX is not used, as long as hardware is actually implementing red differently from atom. On the other hand, we do exactly the same thing in the CPU backend of ocelot (atomics can be messed up if you insert a store in between) for performance reasons (see bug 35 http://code.google.com/p/gpuocelot/issues/detail?id=35 ).
Good to know. I’m glad it is getting used eventually.
I’ve come to the conclusion that ptxas is nearly a fully blown optimizing compiler, which completely goes against the concept of having a platform independent ISA like PTX for doing high level optimizations. What really gets me is that this is not the first example of this that I have seen. Jello in LLVM does the same thing, as do at least two other GPU backend compilers that I have seen (although I can cut them some slack because they were probably developed for shader languages before the existence of PTX/IL/LLVM/etc).
Sylvain, you are brilliant. I had seen this happen just a week ago or two, but I haven’t remembered it.
Which makes me wonder whether [font=“Courier New”]red[/font] in fact is atomic with respect to stores. I’ve been hammering at it for a while with the following program, but so far it seems to behave entirely atomic (Of course, that will never prove that some subtle timing difference wouldn’t hit the race if there is one).
[codebox]
include <stdio.h>
define N 30000
device volatile int anvil[512][32];
device int violations = 0;
device int blocks = 0;
global void hammer(void)
{
const int block = blockIdx.x/2;
const int thread = threadIdx.x;
int* anvil_p = (int*)&(anvil[block][thread]);
int __attribute__ ((unused)) dummy;
if (blockIdx.x & 1) {
for (int i=0; i<2*N; i++) {
atomicAnd(anvil_p, 0xf);
// asm (“red.global.and.b32 [%1], 0xf;” : “=r” (dummy) : “l” ((unsigned long)anvil_p));
}
} else {
for (int i=0; i<N; i++) {
anvil[block][thread] = 1;
if (anvil[block][thread] != 1)
atomicAdd(&violations, 1);
anvil[block][thread] = 0;
if (anvil[block][thread] != 0)
atomicAdd(&violations, 1);
}
}
if (threadIdx.x==0)
atomicAdd(&blocks, 1);
}
int main(void)
{
int v, b;
hammer<<<1024,32>>>();
cudaMemcpyFromSymbol(&v, violations, sizeof(v), 0, cudaMemcpyDeviceToHost);
cudaMemcpyFromSymbol(&b, blocks, sizeof(b ), 0, cudaMemcpyDeviceToHost);
printf(“%s.\n”, cudaGetErrorString(cudaGetLastError()));
printf("%d violations detected after execution of %d blocks.\n", v, b );
return 0;
}
[/codebox]
I’ve run this on sm_11 and sm_13 devices so far. Wonder whether Fermi would make a difference, would be interesting if someone could test this.
By the way, the same cubin is produced if the [font=“Courier New”]red[/font] instruction from the asm is used, which confirms Sylvain’s statement.
I only look at the code generated for Tesla. There is still a possibility that red behaves differently on Fermi, and that that the compiler and the documentation are both correct.
As of predication, I think it was used from the start. (say, CUDA 0.8…) The documentation even mention the thresholds used for deciding when to apply if-conversion.
What was not properly handled until 2.0 or 2.1 were the flags that could be written by any instruction. Like:
add.f32 p1|r2, r0, r1 # r2 = r0 + r1
@p1.leu br label # jump when r2 is negative, zero or NaN
which saves a compare instruction.
But this feature was dropped for Fermi anyway…
(I personally find the Fermi ISA more boring than Tesla’s…)
I am not a compiler guy, but it seems to make sense to me to make the virtual ISA as high-level as possible, and restrain from performing any optimization that would not benefit every target?..
Even high-level optimizations like inlining, loop unrolling or if-conversion have a different impact for every target.
I believe that languages like PTX were partly designed because shader assembly languages were considered too low-level. Maybe you’ve already seen Norm Rubin slides explaining how the Microsoft compiler and their own compiler were fighting each other: CGO2008 keynote
:">
Actually, testing suggests that red has the same performance as atom on Tesla:
http://strobe.cc/cuda_atomics/
So it is believable that they also act the same way with respect to stores.
But Fermi is a different beast…