Programming the PTX virtual machine resolved many high-level issues posted

I have been playing with the PTX virtual machine assembler all weekend and
here are some things that have come up. So here is my “Top 10” questions. I’d
appreciate if you share your experiences here.

  1. I am not sure what section 7.6 of the PTX manual wants to tell me. Is it
    just a warning to be careful to use the cvt instructions correctly? Or what
    is the deal with the 16 bit machine, ie. can two “half instructions” run on
    one ALU as with the half float pipeline on previous GPUs?

  2. It would be very useful, if NVIDIA extended the instruction descriptions in
    section 7.7 of the PTX manual with the instruction issue clock cycle
    counts. I had to experiment to see how rsqrt and sqrt actually
    behave. Similar for mul24/mul I needed to cross reference to the CUDA
    programming manual.

  3. 3D texture fetch works. However I was not able to come up with a correct
    CUDA array definition method that works for all dimension sizes. Sometimes
    I just got lucky and the fetch got the correct data. The problem seems to
    be to compute the correct alignment for the 2D slices within the array. Can
    someone from NVIDIA shed some light on how to compute this? Or even better
    provide a runtime API method?

  4. Subroutine calls seem to work only sometimes. I probably create bugs when
    deciding whether the call is divergent or not, which is quite a pain. Does anyone
    have an idea on how to determine this quickly (without looking through all
    following instructions)? Related question to NVIDIA folks: will the
    parameter model be changed to pass-by-address sometime soon for the .func
    directive (that would allow recursion)?

  5. Selective thread synchronization is very interesting. Will the sync group
    be exposed in the high-level __syncthreads() anytime soon?

  6. Atomic operations on floating point numbers are possible. Why is this not
    allowed in the high-level code? I do not have a sm_11 GPU. Has anyone with
    such a card tried the atomic floating point add?

  7. Hardware debugging using the trap: can someone from NVIDIA shed some light
    on how to catch the interrupt? How do I continue threads suspended with brkpt?

  8. Did I understand the .entry directive correctly: if I define a default
    thread block layout, I do not need to push the block dimensions on the
    param stack (driver API) ?

  9. How do .visible/.extern declared values enter the other scope at runtime?
    Do I need to concat the text or data segments somehow? Or load them

  10. The .surf state space is not supported currently. Am I correct to assume that
    this one will refer to framebuffer memory in the future? Will this include
    multisample buffers?

If you don’t understand what the above list talks about, read the PTX manual
and do some assembler programming yourself first. Otherwise, any comments are
very welcome.


  1. Selective thread synchronization is very interesting. Will the sync group
    be exposed in the high-level __syncthreads() anytime soon?

You mean this works on the card?
How would bar.sync with different numbers behave?

I agree, that would be quite handy. Maybe we could have some architecture specific list in the appendix?

What I’d like to know (see my last post):

Does PTX allow add-with-carry? Simon said the hardware had support for it, but I could not find anything in the manual, not even a wide-add which might be a possible solution (and faster than nailing).


No. You can however add u16 or u32 into a u64 register, mask out the carry and convert it to a predicate with setp.u64 p1,1,0x…,reg . Then you can mask the carry calc cases or move the carry on with another setp.or.u64 p2,a,0x…,reg,!p1 . Not that this is very elegant though … :blink:


Umm, thanks for the idea. I don’t think this will fit my purposes best, but hey, I didn’t tell anything about it ;-).
First, I think using u64 will result in a performance penalty as we don’t have a 64 bit platform and ptxasm will generate some (real) instructions to emulate it. Second, using u64 will cost 1 register (again, talking of 32 bit) just to save the carry bit. Last but not least all that extract-carry-to-predicate and convert-to/from-u64 will most likely cost some cycles. Not sure how much, but I guess it will be more than my current solution (nailing the MSB).


I’m not really an expert on PTX, but I’ll try and answer some of your questions.

Agreed, we’ll try and improve the detail here.

3D textures are not supported in the current release. As you’ve found, there are instructions for performing the fetch, but the API support for allocating 3D arrays is not complete.

I’m not sure on this, I’ll check with the compiler guys.

We could support real function calls and recursion by putting the parameters on the stack, but this would be slow because they would be in local memory.

There is currently no plan to expose the more flexible barrier synchronization at the CUDA level, although if you have compelling applications for this please let us know.

Atomic operations on floats are supported by the PTX abstraction, but not by any current hardware. sm_11 has atomics on integers only.

This is for future use.

I’ll have to check with the compiler team on these.

Yes. I’m not sure about support for multisample buffers.

Thanks Simon. Please put this also in the “Target ISA notes” section of the .atom instruction.


Peter, here’s an answer to your first question from the PTX team:

Section 7.6 refers to the 16-bit integer types { .b16, .u16, .s16, } for instructions {add, sub, mul, mad, div, rem, sad, min, max, set, setp, shr, shl, mov, ld, st, tex, cvt, … }

The 16-bit PTX instructions generally read and write 16-bit PTX registers. PTX 16-bit registers use half the space of 32-bit registers.

Current 8-series GPUs support 16-bit registers, but future GPUs may implement them as 32-bit registers. Section 7.6 is trying to say that the semantics of the 16-bit PTX instructions are specified such that a GPU may promote 16-bit registers and instructions to 32-bit, which allows some results like shift right to be machine-specific rather than strictly as pure 16-bit width. The 8-series GPUs execute 16-bit instructions with the same performance as 32-bit instructions, so the main value is reducing register space.

Thanks Simon, that clears the question.


Any way to get at SAD or MAD from CUDA (C compiler support)?

MAD could be got using a=a*b+c.

To get SAD and stuff, you can always write your own C compiler.