When red instruction is generated?

kmlhy · August 3, 2010, 2:08pm

When ptx code is generated, atomic fucntion is changed into atom or red instruction.
But, in my case atom instruction is always generated.
Can you give me an example of generating red instruction?

tera · August 3, 2010, 2:54pm

Well, I can’t. According to the PTX specification, [font=“Courier New”]red[/font] is not atomic. So whenever it would be generated, it would be in error.

Simon_Green · August 4, 2010, 7:38am

I don’t think that’s true, my understanding is that the reduction instructions are just like the atomic instructions, except they don’t return the original value at the address.

So, in theory we could generate the red instructions when you don’t use the return value of one of the atomic* instinsics. I haven’t verified this, however.

tera · August 4, 2010, 8:51am

The PTX isa says about the red instruction on page 137

while I would read appendix B.11 of the Programming Guide as asserting atomicity also versus normal stores:

Gregory_Diamos · August 4, 2010, 12:53pm

Right, but red instructions would be considered atomic with respect to other red instructions, just not arbitrary st instructions. This rarely matters in real applications; how often do you end up doing something like this pseudo code:

if(threadIdx.x == 0)

{

  reductionAdd(value, 1);

}

else

{

  *value = 0;

}

?

tera · August 4, 2010, 1:39pm

I agree it rarely happens, but sometimes code like this does get written. For sure I don’t want the documentation be inaccurate just because somebody decides that it would rarely matter in practice.

If Nvidia ever decides to change the current behavior, I’d expect them to update the documentation accordingly and place a big caveat in the release notes.

kmlhy · August 4, 2010, 2:12pm

Right, but red instructions would be considered atomic with respect to other red instructions, just not arbitrary st instructions. This rarely matters in real applications; how often do you end up doing something like this pseudo code:
if(threadIdx.x == 0)

{

  reductionAdd(value, 1);

}

else

{

  *value = 0;

}
?

__global__ void atomic_unit_test(int *g_data)

{

	  int tid = threadIdx.x;

	  int bid = blockIdx.x;

	  int nthreads = blockDim.x;

	  int id = tid + bid*nthreads;

	  if ( tid == 0 )

		  atomicAdd(g_data, 1);

	  else

		  *g_data = 0;

}

I think that your pseudo code and this code has the same meaning.

But, red instruction is not generated.

Gregory_Diamos · August 4, 2010, 2:16pm

I agree that the documentation should be as unambiguous as possible, I was speaking more towards the usefulness of a red instruction as opposed to treating everything as atomic.

From a hardware perspective, the semantics of an atomic instruction require a round trip from an SM issuing a request to an atomic ALU attached to the last level cache that locks access to a cache line, services it, and sends the original value back to the SM. A red instruction requires only a one sided message to the atomic ALU that only needs to process one red at a time and can share access to the cache with other write ports without the need for locking. I have no idea if nvidia implements it like this, but other architectures do and there is generally a performance advantage when using one sided reduction instructions on these processors.

Gregory_Diamos · August 4, 2010, 2:21pm

As with many other features, just because it is exposed in the ISA does not mean that the compiler will take advantage of it… I am really trying to push for an open compiler implementation for PTX so that people can take advantage of these features by implementing them if they really want them. There are some really cool new ISA features for fermi such as selective barriers and controlled cache accesses as well as some old ones that have been around since tesla (predication!) that are currently not being generated by nvcc as far as I can tell.

Sylvain_Collange · August 4, 2010, 3:21pm

On the other hand, just because it is not exposed in PTX does not mean that ptxas will not take advantage of it. :)

In fact ptxas will turn atom PTX instructions into REDs (opcode D6) when the output register is not reused. (Which suggests that the documentation is inaccurate…)

Also, it uses predication a lot.

Target-dependent opcodes are cool, but I believe they should not be exposed in PTX when it can be avoided… In a perfect world, they would just get inferred automagically by the backend compiler.

Although I admit that we still need many of them in practice.

And unfortunately, ptxas is a black box…

Gregory_Diamos · August 4, 2010, 4:03pm

That is interesting. So you could run into correctness issues with conflicting st/atomics if the destination register in PTX is not used, as long as hardware is actually implementing red differently from atom. On the other hand, we do exactly the same thing in the CPU backend of ocelot (atomics can be messed up if you insert a store in between) for performance reasons (see bug 35 http://code.google.com/p/gpuocelot/issues/detail?id=35 ).

Good to know. I’m glad it is getting used eventually.

I’ve come to the conclusion that ptxas is nearly a fully blown optimizing compiler, which completely goes against the concept of having a platform independent ISA like PTX for doing high level optimizations. What really gets me is that this is not the first example of this that I have seen. Jello in LLVM does the same thing, as do at least two other GPU backend compilers that I have seen (although I can cut them some slack because they were probably developed for shader languages before the existence of PTX/IL/LLVM/etc).

tera · August 5, 2010, 12:07am

Sylvain, you are brilliant. I had seen this happen just a week ago or two, but I haven’t remembered it.

Which makes me wonder whether [font=“Courier New”]red[/font] in fact is atomic with respect to stores. I’ve been hammering at it for a while with the following program, but so far it seems to behave entirely atomic (Of course, that will never prove that some subtle timing difference wouldn’t hit the race if there is one).

[codebox]

include <stdio.h>

define N 30000

device volatile int anvil[512][32];

device int violations = 0;

device int blocks = 0;

global void hammer(void)

{

const int block = blockIdx.x/2;

const int thread = threadIdx.x;

int* anvil_p = (int*)&(anvil[block][thread]);

int __attribute__ ((unused)) dummy;

if (blockIdx.x & 1) {

    for (int i=0; i<2*N; i++) {

        atomicAnd(anvil_p, 0xf);

// asm (“red.global.and.b32 [%1], 0xf;” : “=r” (dummy) : “l” ((unsigned long)anvil_p));

    }

} else {

    for (int i=0; i<N; i++) {

        anvil[block][thread] = 1;

        if (anvil[block][thread] != 1)

            atomicAdd(&violations, 1);

        anvil[block][thread] = 0;

        if (anvil[block][thread] != 0)

            atomicAdd(&violations, 1);

    }

}

if (threadIdx.x==0)

    atomicAdd(&blocks, 1);

}

int main(void)

{

int v, b;

hammer<<<1024,32>>>();

cudaMemcpyFromSymbol(&v, violations, sizeof(v), 0, cudaMemcpyDeviceToHost);

cudaMemcpyFromSymbol(&b, blocks,        sizeof(b ), 0, cudaMemcpyDeviceToHost);

printf(“%s.\n”, cudaGetErrorString(cudaGetLastError()));

printf("%d violations detected after execution of %d blocks.\n", v, b );

return 0;

}

[/codebox]

I’ve run this on sm_11 and sm_13 devices so far. Wonder whether Fermi would make a difference, would be interesting if someone could test this.

By the way, the same cubin is produced if the [font=“Courier New”]red[/font] instruction from the asm is used, which confirms Sylvain’s statement.

Sylvain_Collange · August 5, 2010, 9:50am

I only look at the code generated for Tesla. There is still a possibility that red behaves differently on Fermi, and that that the compiler and the documentation are both correct.

As of predication, I think it was used from the start. (say, CUDA 0.8…) The documentation even mention the thresholds used for deciding when to apply if-conversion.

What was not properly handled until 2.0 or 2.1 were the flags that could be written by any instruction. Like:

add.f32 p1|r2, r0, r1  # r2 = r0 + r1

@p1.leu br label	   # jump when r2 is negative, zero or NaN

which saves a compare instruction.

But this feature was dropped for Fermi anyway…

(I personally find the Fermi ISA more boring than Tesla’s…)

I am not a compiler guy, but it seems to make sense to me to make the virtual ISA as high-level as possible, and restrain from performing any optimization that would not benefit every target?..

Even high-level optimizations like inlining, loop unrolling or if-conversion have a different impact for every target.

I believe that languages like PTX were partly designed because shader assembly languages were considered too low-level. Maybe you’ve already seen Norm Rubin slides explaining how the Microsoft compiler and their own compiler were fighting each other: CGO2008 keynote

:">

Actually, testing suggests that red has the same performance as atom on Tesla:

http://strobe.cc/cuda_atomics/

So it is believable that they also act the same way with respect to stores.

But Fermi is a different beast…

Topic		Replies	Views
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7718	January 28, 2012
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6118	December 2, 2022
low level hardware documentation CUDA Programming and Performance	23	3552	November 28, 2014
Ptxas slow CUDA Programming and Performance cuda , kernel	35	2045	May 2, 2024
PTX instructions are reordered CUDA Programming and Performance	12	1509	May 13, 2024
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3276	January 20, 2019
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9631	June 25, 2013
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w CUDA Programming and Performance	13	3205	October 12, 2011
Program works on GTX 260 but not on GTX 480 CUDA Programming and Performance	34	3267	November 24, 2010
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27723	November 5, 2008

When red instruction is generated?

Related topics