Programming the PTX virtual machine resolved many high-level issues posted

prkipfer · July 8, 2007, 12:35pm

I have been playing with the PTX virtual machine assembler all weekend and
here are some things that have come up. So here is my “Top 10” questions. I’d
appreciate if you share your experiences here.

I am not sure what section 7.6 of the PTX manual wants to tell me. Is it
just a warning to be careful to use the cvt instructions correctly? Or what
is the deal with the 16 bit machine, ie. can two “half instructions” run on
one ALU as with the half float pipeline on previous GPUs?
It would be very useful, if NVIDIA extended the instruction descriptions in
section 7.7 of the PTX manual with the instruction issue clock cycle
counts. I had to experiment to see how rsqrt and sqrt actually
behave. Similar for mul24/mul I needed to cross reference to the CUDA
programming manual.
3D texture fetch works. However I was not able to come up with a correct
CUDA array definition method that works for all dimension sizes. Sometimes
I just got lucky and the fetch got the correct data. The problem seems to
be to compute the correct alignment for the 2D slices within the array. Can
someone from NVIDIA shed some light on how to compute this? Or even better
provide a runtime API method?
Subroutine calls seem to work only sometimes. I probably create bugs when
deciding whether the call is divergent or not, which is quite a pain. Does anyone
have an idea on how to determine this quickly (without looking through all
following instructions)? Related question to NVIDIA folks: will the
parameter model be changed to pass-by-address sometime soon for the .func
directive (that would allow recursion)?
Selective thread synchronization is very interesting. Will the sync group
be exposed in the high-level __syncthreads() anytime soon?
Atomic operations on floating point numbers are possible. Why is this not
allowed in the high-level code? I do not have a sm_11 GPU. Has anyone with
such a card tried the atomic floating point add?
Hardware debugging using the trap: can someone from NVIDIA shed some light
on how to catch the interrupt? How do I continue threads suspended with brkpt?
Did I understand the .entry directive correctly: if I define a default
thread block layout, I do not need to push the block dimensions on the
param stack (driver API) ?
How do .visible/.extern declared values enter the other scope at runtime?
Do I need to concat the text or data segments somehow? Or load them
together?
The .surf state space is not supported currently. Am I correct to assume that
this one will refer to framebuffer memory in the future? Will this include
multisample buffers?

If you don’t understand what the above list talks about, read the PTX manual
and do some assembler programming yourself first. Otherwise, any comments are
very welcome.

Peter

asadafag · July 10, 2007, 3:23am

Selective thread synchronization is very interesting. Will the sync group
be exposed in the high-level __syncthreads() anytime soon?

You mean this works on the card?
How would bar.sync with different numbers behave?

rsz · July 11, 2007, 5:13pm

I agree, that would be quite handy. Maybe we could have some architecture specific list in the appendix?

What I’d like to know (see my last post):

Does PTX allow add-with-carry? Simon said the hardware had support for it, but I could not find anything in the manual, not even a wide-add which might be a possible solution (and faster than nailing).

Robert.

prkipfer · July 11, 2007, 5:28pm

No. You can however add u16 or u32 into a u64 register, mask out the carry and convert it to a predicate with setp.u64 p1,1,0x…,reg . Then you can mask the carry calc cases or move the carry on with another setp.or.u64 p2,a,0x…,reg,!p1 . Not that this is very elegant though … :blink:

Peter

rsz · July 11, 2007, 6:02pm

Umm, thanks for the idea. I don’t think this will fit my purposes best, but hey, I didn’t tell anything about it ;-).
First, I think using u64 will result in a performance penalty as we don’t have a 64 bit platform and ptxasm will generate some (real) instructions to emulate it. Second, using u64 will cost 1 register (again, talking of 32 bit) just to save the carry bit. Last but not least all that extract-carry-to-predicate and convert-to/from-u64 will most likely cost some cycles. Not sure how much, but I guess it will be more than my current solution (nailing the MSB).

Thanks!
Robert

Simon_Green · July 12, 2007, 1:59pm

I’m not really an expert on PTX, but I’ll try and answer some of your questions.

Agreed, we’ll try and improve the detail here.

3D textures are not supported in the current release. As you’ve found, there are instructions for performing the fetch, but the API support for allocating 3D arrays is not complete.

I’m not sure on this, I’ll check with the compiler guys.

We could support real function calls and recursion by putting the parameters on the stack, but this would be slow because they would be in local memory.

There is currently no plan to expose the more flexible barrier synchronization at the CUDA level, although if you have compelling applications for this please let us know.

Atomic operations on floats are supported by the PTX abstraction, but not by any current hardware. sm_11 has atomics on integers only.

This is for future use.

I’ll have to check with the compiler team on these.

Yes. I’m not sure about support for multisample buffers.

prkipfer · July 12, 2007, 2:13pm

Thanks Simon. Please put this also in the “Target ISA notes” section of the .atom instruction.

Peter

Simon_Green · July 25, 2007, 3:28pm

Peter, here’s an answer to your first question from the PTX team:

Section 7.6 refers to the 16-bit integer types { .b16, .u16, .s16, } for instructions {add, sub, mul, mad, div, rem, sad, min, max, set, setp, shr, shl, mov, ld, st, tex, cvt, … }

The 16-bit PTX instructions generally read and write 16-bit PTX registers. PTX 16-bit registers use half the space of 32-bit registers.

Current 8-series GPUs support 16-bit registers, but future GPUs may implement them as 32-bit registers. Section 7.6 is trying to say that the semantics of the 16-bit PTX instructions are specified such that a GPU may promote 16-bit registers and instructions to 32-bit, which allows some results like shift right to be machine-specific rather than strictly as pure 16-bit width. The 8-series GPUs execute 16-bit instructions with the same performance as 32-bit instructions, so the main value is reducing register space.

prkipfer · July 25, 2007, 4:06pm

Thanks Simon, that clears the question.

Peter

jesser · August 28, 2007, 4:26pm

Any way to get at SAD or MAD from CUDA (C compiler support)?

asadafag · August 29, 2007, 2:01am

MAD could be got using a=a*b+c.

To get SAD and stuff, you can always write your own C compiler.

Topic		Replies	Views
What about half-float? CUDA Programming and Performance	18	29337	October 26, 2017
CUDA memory transactions CUDA Programming and Performance	9	8811	April 11, 2011
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3210	January 20, 2019
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204315	April 13, 2009
questions about thread execution & volatile CUDA Programming and Performance	19	16894	December 29, 2008
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210967	January 18, 2014
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45184	April 17, 2011
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27719	November 5, 2008
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134564	May 26, 2010
switch construction from C compiled to sequential if elseif elseif ... CUDA Programming and Performance	11	8170	August 1, 2008

Programming the PTX virtual machine resolved many high-level issues posted

Related topics