Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w

Hello,

I’m trying to write some PTX code by hand, compile it to .cubin, and have the driver api execute it directly, and was running into some problems getting off the ground. The reason I’m trying write it this way instead of using the inline PTX is because I think I can do significantly better than the nvcc compiler in terms of allocating registers for what I’m trying to do.

Anyway, I’ve written an extremely simple PTX program, and I can’t figure out what’s wrong. I tried looking at the examples, and I’m not having any luck. My program works as follows: The main program (written in c) is supposed to create an array with one value, set it to “777”, setup my cuda device/context/kernel stuff, and copy that single-element array to the device. From here, my kernel is launched (the cubin file). My kernel is comically simple. It’s one thread that writes “2222” to the single-element array and exits. Once control is returned to the host, it copies the data back, prints the result (which is some random number and not “777” or “2222”), and finally it de-allocates the device memory.

I check for errors after every cuXXX call, and oddly enough none occur except when I free up the device memory, where I get an error code 700 which is “CUDA_ERROR_LAUNCH_FAILED”. If anyone can comment on what’s causing that I’d appreciate it.

I’m using version 4.0 of the nvcc compiler, and I have the latest drivers installed.

Attached is code for my main function, the ptx code, and my makefile. The executable (simple_test) and the cubin file (kernel.cubin) are copied one directory level above, where I try running the code.

I would greatly appreciate it if anyone can explain what I’m doing wrong.

Thanks.

kernel.ptx:

.version 2.3

.target sm_20

.address_size 32

.entry kernel( .param .u32 A)

{

	.reg .u32 %a;

	ld.param.u32 %a, [A];

	st.global.u32 [%a], 2222;

	exit;

}

main.c:

#include <stdio.h>

#include <cuda.h>

int main(int argv, char* argc[])

{

	unsigned int h_a[1] = {777};

	

	//using driver api	

	//setup something to put error codes in

	CUresult error;

	

	//initialize

	error = cuInit(0);

	printf("cuinit returns: %d\n",int(error));

	

	//get device 0, put the handle in cuDevice0

	CUdevice cuDevice0;

	error = cuDeviceGet(&cuDevice0, 0);

	printf("cuDeviceGet returns: %d\n",int(error));

	

	//create a context to run on device 0

	CUcontext cuContext0;

	error = cuCtxCreate(&cuContext0, 0, cuDevice0);

	printf("cuCtxCreate returns: %d\n",int(error));

	

	//create module from cubin file

	CUmodule cuModule0;

	error = cuModuleLoad(&cuModule0, "kernel.cubin");

	printf("cuModuleLoad returns: %d\n",int(error));

	

	//device memory pointers

	CUdeviceptr d_a;

	//allocate device memory

	unsigned int allocsize = 1*sizeof(unsigned int);

	error = cuMemAlloc(&d_a, allocsize);

	printf("allocating d_a returns: %d\n",int(error));

	//copy host arrays to the device

	error = cuMemcpyHtoD(d_a, h_a, allocsize);

	printf("copy data for a returns: %d\n",int(error));

	//get function handle

	CUfunction kernel0;

	error = cuModuleGetFunction(&kernel0, cuModule0, "kernel");

	printf("getting the function handle returns: %d\n",int(error));

	

	//launch kernel

	void *args[] = { &d_a};

	error = cuLaunchKernel(kernel0, 1, 1, 1, 1, 1, 1, 0, NULL, args, NULL);

	printf("kernel launch returns: %d\n",int(error));

	

	error = cuMemcpyDtoH(h_a, d_a, allocsize);

	printf("copy from device to host returns: %d\n",int(error));

//free up device memory

    error = cuMemFree(d_a);

	printf("freeing up d_a returns: %d\n",int(error));

	

	//print out the result

	printf("h_a[1] = %u\n",h_a[1]);

	return 0;

}

Makefile:

NVCC_FLAGS = --ptxas-options -v

FLAGS =

all: simple_test

	

simple_test: main.o kernel.cubin

	gcc $(FLAGS) -L /usr/local/cuda/lib64 -lcuda -lcudart -o simple_test main.o

	cp simple_test ../.

	

main.o:	main.cu

	nvcc $(NVCC_FLAGS) -c -gencode arch=compute_20,code=sm_20 -o main.o main.cu

	

kernel.cubin:

	nvcc $(NVCC_FLAGS) -cubin -gencode arch=compute_20,code=sm_20 -o kernel.cubin kernel.ptx

	cp kernel.cubin ../.	

clean:

	-rm *.o

	-rm simple_test

	-rm ../simple_test

	-rm kernel.cubin

	-rm ../kernel.cubin

	

.PHONY: all clean

Please note that register allocation is performed by PTXAS (or the equivalent JIT component inside the CUDA driver) when translating from PTX to machine code. It is possible that carefully crafting virtual register allocation at PTX level may lead to better allocation of real registers in the machine code for some codes, but I would expect the difference in general to be at noise level. PTX code is generated using SSA style, which basically means each time a new result is created it is assigned to a new register. The Wikipedia has an article on SSA (static single assignment) which you may find interesting: Static single-assignment form - Wikipedia

There is a problem when nvcc re-associates expressions to increase ILP, which limits PTXAS’s ability to keep register count sane. Eg,
a + b + c + d + e + g + h + i will be re-associated as (a + b) + (c + d) + (e + g) + (h + i), creating four temporaries when only one is needed. I’ve found this to be the most annoying aspect of the compiler, as it will become pathologically bad at points. Sometimes I outsmart nvcc by storing a temp that I generate into volatile shared memory, which imposes some order on the thing. The optimizer shouldn’t try to achieve an ILP past 4.0 anyways (or better yet, some user-defined constant), as it becomes counter-productive.

kleboeuf - you should use launch_bounds on all your kernels. It does what -maxrregcount is supposed to do, except it actually works, and it lets you specify the launch bounds per kernel instead of per module. If you specify launch_bounds(128, 5), for instance, it will attempt to make a kernel with an occupancy of 5 blocks per SM, with 128 threads per block. This implies a register limit, which you can get from an occupancy table (I made one here: http://www.moderngpu.com/intro/workflow.html#Occupancy )

If you used too much shared memory to fit 5 blocks per SM (you only get 48k), then launch_bounds rounds down to 4, tries again, etc., and adjusts the register ceiling accordingly.

Inline PTX is a non-starter because it doesn’t support memory operations and can generate very funky code. I only use it for accessing instructions (like bfe and bfi and prmt) that are either not available as CUDA intrinsics or are badly implemented (eg using __byte_perm always generates an unnecessary and unwanted mask when using a dynamic gather code). Inline PTX is not an effective way to improve register allocation. External PTX is more intended to build architecture-portable compilers over. I wouldn’t try writing it by hand. Better to learn the quirks of nvcc and deal with it.

Where are you calling ptxas? You’d need to use that to convert your .ptx file into a .cubin before you can call your kernel through the CUDA driver. (Or you can use the cuModuleLoadData/cuModuleLoadDataEx driver API calls, if you want to pass the PTX directly.)

I assume we are talking about integer arithmetic here? To my knowledge we do not allow re-association for floating-point arithmetic as this is not an identity transformation due to the non-associativity of floating-point arithmetic. If you have a repro case where the effect you describe has a significant negative impact on an application’s performance, I would suggest filing a bug.

I am not aware of any failure of -maxrregcount to limit register use to what has been specified. Could you please elaborate in which way it does not work? Obviously -maxrregcount provides only a simple control mechanism with per-compilation-unit granularity, while launch_bounds provides a more sophisticated control as it works on a per-kernel basis, and is also more flexible when used across different architectures as it does not specify the register count explicitly. Historically the simple -maxrregcount control came first, and at a later stage we added launch_bounds as a more flexible mechanism. In my experience, these days it is usually sufficient to let the compiler pick a register count target to get good performance as the necessary heuristics have improved a lot from the early days of CUDA. I would agree that launch_bounds is the preferred way of controlling register usage where necessary at this point in time.

I am not sure what you mean by “PTX […] doesn’t support memory operations”. Could you elaborate / clarify, please? I wrote various little example apps to test drive inline PTX before this feature was rolled out. One example involved a CORDIC implementation accessing a table of constants using the LD instruction. I’d be happy to post it if there is interest.

The __byte_perm() intrinsic is defined as a pure permutation intrinsic for maximum compatibility with similar intrinsics on other platforms and guards against accidental invocation of the special sign-extension mode of the underlying instruction by masking these bits in the control word. This requires a simple AND with 0x7777 which gets optimized away if the control word is a constant, but it does create a bit of overhead if the control word is generated dynamically. So far I have not encountered an application where this has significant performance impact, but of course such applications may exist, in which case PTX inline assembly can be used to access the PRMT instruction directly.

Any use of an inline PTX memory instruction results in the entire module using generic addressing. Really bad when you just want to use a shared mem address as an operand. I tried it a lot of ways when writing my sort and it always caused a > 20% increase in total instruction count. Also if I recall, it caused local memory to be used when doing things like red.shared (which I don’t believe is accessible from an intrinsic). I don’t have the results on hand now, but I do distinctly recall some seriously deviant code when using inline PTX with mem operations.

As far as register runaway due to re-association, next time I encounter it I’ll send you the source.

In C and C++ “a pointer is a pointer is a pointer”, i.e. there is no notion of different memory space as a pointer can point anywhere. This causes trouble when operating on a device that does actually have different memory spaces. Thus the compiler warning on sm_1x devices “can not figure out what pointer is pointing to, assuming global”. For sm_2x we therefore introduced generic addressing in the hardware to match the high-level language concept of generic pointers.

For inline assembly on sm_2x, all pointers bound by the constraints of an asm() statement are generic pointers, and inside the inline PTX generic loads (LD without suffix) have to be used. For PTX inline assembly on sm_1x, memory space specific loads can be used, but of course the programmer has to ensure that a pointer to the appropriate memory space is bound to the constraint. This is not always possible and in that sense the use of memory accesses fron inside inline PTX is somewhat limited.

For any functional issues / bugs encountered with the compiler, as well as significant performance problems, please file bugs via the registered developer website. I am well aware that it takes effort to construct simple, self contained repro cases and type up a bug report, so thank you in advance for your help.

First off, thank you for all the replies.

I realize that writing PTX code by hand is generally not the best way to do things, but I feel that it’s the only way to do what I want at this point. To clarify:

I’m writing in PTX for two reasons actually. The first has to do with keeping the register count down like I mentioned before. The second is that I’m working on some multiprecision arithmetic problems, and need to use addition with carry, which has no intrinsic. I tried using inline PTX, and this did not work as expected. As SeanB said, it definitely generated some ‘very funky’ code. Specifically, it looked like a lot more registers were being used than there should have been. I haven’t saved my results from that attempt, but it definitely wasn’t working the way I intended, and I eventually gave up.

I also agree with SeanB when he was talking about PTXAS’ ability to keep the register count down. I tried using launch bounds and maxrregcount to keep this under control, which had the ultimate effect of forcing things to use local memory. I ended up finding the “volatile trick” (see [forums.nvidia.com] )which went a long way towards fixing that problem.

Anyway, to re-cap, the reason I want to make some code directly in PTX is because:

  1. I think I can get better register use writing it by hand, even after considering the volatile trick, maxrregcount, and launch bounds
  2. I need access to some instructions that have no intrinsic (add with carry in / carry out)

The problem I am having at this point is that my very simple PTX program does not work. I suspect I’m missing something really simple, and was hoping someone might be able to point it out.

As a last resort, today I will try using the vector addition with driver API code example from the SDK, and remove tiny little bits at a time until I get what I need. If anyone gets the chance to take a look at my code and finds the problem before then it would be great, though!

Thanks again for your replies!

Norbert,

I have not followed the whole discussion and have no particular need for inline PTX examples, but just out of technical curiosity, I would absolutely love to see this.

(Is it just an artificial example, or is there any real application for CORDIC on a GPU?)

Thought I would update anyone that’s following this on my progress:

I’ve been taking a closer look at things, and I think I’ve found the problem, or at least a problem with what I was trying to do.

In my PTX code, I am treating the address passed to my function through the .param state space as a 32-bit value. I now believe that this is incorrect. Back in the c code calling my PTX function, the actual type being passed is ‘CUdeviceptr’, which I originally assumed to be a 32-bit value. According to the api reference on page 176, heading 4.27.2.15, we’ve got: typedef unsigned int CUdeviceptr, and an unsigned int is 32 bits on mine (and probably everyone’s) host machine architecture.

I dug deeper and actually checked the header file (cuda.h) containing the definition for CUdeviceptr that the compiler was using. Here’s what I get:

/**

 * CUDA device pointer

 */ 

#if __CUDA_API_VERSION >= 3020

#if defined(__x86_64) || defined(AMD64) || defined(_M_AMD64)

typedef unsigned long long CUdeviceptr;

#else

typedef unsigned int CUdeviceptr;

#endif

So in other words I’m basically being forced to use 64-bit addressing, which is going to end up eating a few more of my precious registers.

While I confirm that this is actually what is going on, I was wondering if anyone knows of a way to force 32-bit addressing on a 64-bit machine… should I start a new thread?

Got it!

Just posting this in case anyone comes to this thread looking for an answer.

I updated my PTX code to use 64-bit values for my addresses and it worked. Also, there was a small bug in the host code that did not print the correct array element. Below is the updated PTX code, and the corrected host code stub:

working PTX code:

.version 2.3

.target sm_20

.address_size 64

.entry kernel( .param .u64 A)

{

	.reg .u64 %a;

	

	ld.param.u64 %a, [A];

	st.global.u64 [%a+0], 2222;

	exit;

}

host code fix for main.cu, all the way at the bottom of the code posted earlier:

(...)        

        //print out the result

        printf("h_a[1] = %u\n",h_a[1]);

return 0;

}

I’m still left wondering if there is a way to force 32-bit addressing, but I guess that is the topic of another thread.

Thanks for bringing the documentation issue regarding CUdeviceptr to our attention. I have filed a bug to get this fixed for future CUDA releases.

Hmm… Looks like I’m late for this discussion. But I just can’t resist making a little ad here

If you want to hand tune the native ISA instructions, check out asfermi :)

kleboeuf,

You don’t have to use 64-bit addressing if you know the device pointer is actually a device pointer (and not a pointer to device-mapped host memory). If that’s the case, the pointer will be a 32-bit pointer, but per the CUDA/PTX docs, the pointer needs to be aligned to the size of a host pointer (in your case, 64 bits / 8 bytes). So, you could just leave your parameter defined as a .u64 and use an ‘ld.param.u32’ to load the value, or you could define the parameter as ‘.param .align 8 .u32 A’.

Important note, I don’t think will work on the new Teslas (the 6GB models) and even if it does, it’s not going to be future-proof.