Ability to run PTX directly

I have been writing some CUDA microbenchmarks using PTX to get around some of the optimizations in the compiler. We started with Ocelot (http://code.google.com/p/gpuocelot/) and extended the CUDA Runtime API to support executing inlined PTX.

I thought that I would post this here to see if this interface would be useful to people here.

An example of executing a kernel in PTX is shown below:

long long unsigned int atomic_increment(long long unsigned int iterations, 

	unsigned int threads, unsigned int ctas)

{

	long long unsigned int* counter;

	

	cudaMalloc( (void**)&counter, sizeof(long long unsigned int) );

	cudaMemset( counter, 0, sizeof(long long unsigned int) );

	cudaConfigureCall( dim3( ctas, 1, 1 ), dim3( threads, 1, 1), 0, 0 );

	cudaSetupArgument( &iterations, sizeof(long long unsigned int), 0 );

	cudaSetupArgument( &counter, sizeof(long long unsigned int),

		sizeof(long long unsigned int) );

	

	std::string program = ".version 1.4\n";

	

	program += ".target sm_10, map_f64_to_f32\n\n";

	program += ".entry atomic_increment( .param .u64 iterations, .param .u64 counter )\n";

	program += "{\n";

	program += " .reg .u64 %r<3>;\n";

	program += " .reg .pred %p0;\n";

	program += " Entry:\n";

	program += "  ld.param.u64 %r0, [iterations];\n";

	program += "  mov.u64 %r1, 0;\n";

	program += "  ld.param.u64 %r2, [counter];\n";

	program += "  setp.ge.u64 %p0, %r1, %r0;\n";

	program += "  @%p0 bra Exit;\n";

	program += " Loop_Begin:\n";

	program += "  atom.global.add.u64 %r1, [%r2], 1;\n";

	program += "  setp.lt.u64 %p0, %r1, %r0;\n";

	program += "  @%p0 bra Loop_Begin;\n";

	program += " Exit:\n";

	program += "  exit;";

	program += "}\n";

	

	std::stringstream stream( program );

	

	ocelot::registerPTXModule( stream, "atomics" );

	const char* kernelPointer = ocelot::getKernelPointer( 

		"atomic_increment", "atomics" );

	cudaLaunch(kernelPointer);

	

	long long unsigned int counterValue;

	

	cudaMemcpy( &counterValue, counter, sizeof(long long unsigned int), 

		cudaMemcpyDeviceToHost );

	cudaFree( counter );

	

	return counterValue;

}

This is probably going to be included in a future release of Ocelot and I was wondering if anyone would find this useful or if anyone has any suggestions on how to improve the interface.

Is it then possible to have this kernel executed either in some CPU emulation mode or on an actual CUDA device?

If so, this would be ocelent! ;)

That is the end goal, unfortunately we aren’t quite there yet. We do have about 60-70% of the applications in the CUDA SDK running on GPUs, but there are still a few bugs to smooth out with some of the more complex features in CUDA and some of the optimizations that we apply to PTX programs before executing them on GPUs. Every application that we have tested (including thrust and several other large apps) currently runs on the emulator. The idea is to just export the different backends as different CUDA devices. For example, this is DeviceQuery running on my machine:

There are 4 devices supporting CUDA

Device 0: "Ocelot PTX Emulator"

  Major revision number:						 1

  Minor revision number:						 3

  Total amount of global memory:				 944578560 bytes

  Number of multiprocessors:					 8

  Number of cores:							   64

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 944578560

  Warp size:									 512

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  944578560 bytes

  Texture alignment:							 1 bytes

  Clock rate:									2.00 GHz

  Concurrent copy and execution:				 No

Device 1: "Ocelot LLVM JIT-Compiler"

  Major revision number:						 1

  Minor revision number:						 3

  Total amount of global memory:				 944578560 bytes

  Number of multiprocessors:					 8

  Number of cores:							   64

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 944578560

  Warp size:									 1

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  944578560 bytes

  Texture alignment:							 1 bytes

  Clock rate:									2.00 GHz

  Concurrent copy and execution:				 No

Device 2: "Tesla C1060"

  Major revision number:						 1

  Minor revision number:						 3

  Total amount of global memory:				 4294705152 bytes

  Number of multiprocessors:					 30

  Number of cores:							   240

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 16384

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.30 GHz

  Concurrent copy and execution:				 No

Device 3: "GeForce 8800 GTS"

  Major revision number:						 1

  Minor revision number:						 0

  Total amount of global memory:				 334823424 bytes

  Number of multiprocessors:					 12

  Number of cores:							   96

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.30 GHz

  Concurrent copy and execution:				 No

No guarantees, but we are pushing to finish validating our implementation by mid-december or so.