I have been playing with the PTX virtual machine assembler all weekend and
here are some things that have come up. So here is my “Top 10” questions. I’d
appreciate if you share your experiences here.
I am not sure what section 7.6 of the PTX manual wants to tell me. Is it
just a warning to be careful to use the cvt instructions correctly? Or what
is the deal with the 16 bit machine, ie. can two “half instructions” run on
one ALU as with the half float pipeline on previous GPUs?
It would be very useful, if NVIDIA extended the instruction descriptions in
section 7.7 of the PTX manual with the instruction issue clock cycle
counts. I had to experiment to see how rsqrt and sqrt actually
behave. Similar for mul24/mul I needed to cross reference to the CUDA
3D texture fetch works. However I was not able to come up with a correct
CUDA array definition method that works for all dimension sizes. Sometimes
I just got lucky and the fetch got the correct data. The problem seems to
be to compute the correct alignment for the 2D slices within the array. Can
someone from NVIDIA shed some light on how to compute this? Or even better
provide a runtime API method?
Subroutine calls seem to work only sometimes. I probably create bugs when
deciding whether the call is divergent or not, which is quite a pain. Does anyone
have an idea on how to determine this quickly (without looking through all
following instructions)? Related question to NVIDIA folks: will the
parameter model be changed to pass-by-address sometime soon for the .func
directive (that would allow recursion)?
Selective thread synchronization is very interesting. Will the sync group
be exposed in the high-level __syncthreads() anytime soon?
Atomic operations on floating point numbers are possible. Why is this not
allowed in the high-level code? I do not have a sm_11 GPU. Has anyone with
such a card tried the atomic floating point add?
Hardware debugging using the trap: can someone from NVIDIA shed some light
on how to catch the interrupt? How do I continue threads suspended with brkpt?
Did I understand the .entry directive correctly: if I define a default
thread block layout, I do not need to push the block dimensions on the
param stack (driver API) ?
How do .visible/.extern declared values enter the other scope at runtime?
Do I need to concat the text or data segments somehow? Or load them
The .surf state space is not supported currently. Am I correct to assume that
this one will refer to framebuffer memory in the future? Will this include
If you don’t understand what the above list talks about, read the PTX manual
and do some assembler programming yourself first. Otherwise, any comments are