PTX Emulator Released

A while ago I posted about a CUDA/PTX emulator that my research group developed last semester.

http://forums.nvidia.com/index.php?showtop…amp;hl=emulator

We finally released the source code today under the BSD license. The emulator implements the PTX virtual machine and executes programs using a single CPU thread, one instruction at a time. We have verified that all of the CUDA SDK examples from 2.1 and 2.2 run using the emulator except for the programs that use the Driver Level API, which we do not support. Like Barra and GPGPU-sim, we provide a set of libraries that replace libcudart.so, so you should be able to link any CUDA program against the emulator and have it transparently replace the NVIDIA driver and runtime.

The emulator has hooks for trace generators that can examine the complete system state after each instruction is executed. We have several trace generators to record all memory traffic and inter-thread communication through shared memory in place already and it should be fairly easy to add others.

We also release a set of program analysis tools for PTX that allows you to generate control flow graphs, dominator trees, dataflow graphs, and convert PTX to pure SSA form as part of the code base.

The entire project can be downloaded here http://gpuocelot.googlecode.com/files/ocelot-0.4.50.tar.gz . API documentation can be found here: http://www.gdiamos.net/classes/translator/api/index.html . We have a mailing list here in case you would like to contribute ideas or hear about updates: http://groups.google.com/group/gpuocelot . Finally, we have put together a quick tutorial for running a CUDA program on the emulator: http://code.google.com/p/gpuocelot/wiki/Installation .

We plan to continue to develop this project with the goal of eventually having a complete compilation chain from CUDA for x86 CPUs as well as NVIDIA GPUs as well as analysis tools supporting each path.

Hopefully people here find this useful.

So what would happen if I run a CUDA kernel with invalid memory accesses through Ocelot? Will it crash, indicate bad reads in Valgrind, or what?

Excellent! This is indeed a great new resource.

Can you explain a little bit how warp emulation works? From your install docs:

Does this mean the emulator warp size is configurable? Or is always 1 in your current version?

Intermediate-level CUDA programming uses the idea of automatic free warp synchronization quite often, but this causes headaches in nvcc emulation code.

I think he wrote in an older post that it was to be configurable, so that you could make sure your code worked with any warp size (in case it ever changes from 32). I don’t know if this is the case for sure though…just an educated guess.

In the current version it is set equal to the CTA/block size. So it depends on the number of threads that you launch. You can think of it being set to the max number of threads per CTA (512).

In future versions we are at least considering allowing the optimizer to determine the warp size at runtime, in which case it could change each time you execute the program or even be different for different CTAs in the same program.

Right now we detect invalid memory accesses to shared/constant/local memory but not global memory. So you will get an error message for shared/constant/local memory and a segfault for global memory. We are working to handle detection of global memory accesses as well.

You should come to the GPU Technology Conference–I’ll buy you a beer for this. :)

Thanks, I might just hold you to that :) . I plan on presentation a poster on either this or another project at the research summit.

If people think that it is important enough, I can try to implement detection of all memory errors in the next week or so. Out of bounds accesses should be pretty easy but uninitialized values like valgrind does will be much harder. I’ve added an issue for it here: http://code.google.com/p/gpuocelot/issues/detail?id=10

Otherwise I was planning on working on code analysis for the llvm backend over the next few weeks…

Many (if not most) device memory errors come from accidentally using host pointers on the device. If you could detect those cases and distinguish them from generic memory access failure, it’d be a great debugging clue. It may not be completely possible of course (since there’s no way to look at an address and just know it’s a host value) but perhaps you could flag all host memory regions that were used for cudamemcpy() or hostmemalloc() and on the segfault print “this memory was used previously for host access, you may be using a host pointer.” or some such hint. This may be impractical, but maybe it’ll stick in your head if it comes up somehow.

So this is now implemented… Actually it was very easy to do. Our library already intercepts every cudaMalloc call and keeps an entry associated with a specific device so that it can tell which memory blocks were allocated by which cuda contexts. This is needed to ensure that one context cannot deallocate memory allocated by another context. All that was needed was to propagate this information into th emulator and have it look up into the memory allocation table…

This is in the head revision now. I still need to add some more debugging messages to make it useful and run it through all the regression tests…

Two beers, then!

(I hope people start banging on this a lot, because this sounds exactly like what deviceemu should be.)

bah! double post

awwww!! Why now! I won’t any time to play with this for more than a week :(

You can be sure than as soon as I’ve got the free time, I’ll be seeing if all of HOOMD’s unit tests pass when running with Ocelot :)

aw, Mister (Doctor?) Anderson, you know I’ll buy you a beer anytime :)

Alright so the memory checking functionality seems to be working correctly. I actually found a few SDK bugs thanks to this:

  1. in scanLargeArray_kernel.cu in SDK 2.1 the following section starting on line 244:
__global__ void uniformAdd(float *g_data, 

						   float *uniforms, 

						   int n, 

						   int blockOffset, 

						   int baseIndex)

{

	__shared__ float uni;

	if (threadIdx.x == 0)

		uni = uniforms[blockIdx.x + blockOffset];

	

	unsigned int address = __mul24(blockIdx.x, (blockDim.x << 1)) + baseIndex + threadIdx.x; 

	__syncthreads();

	

	// note two adds per thread

	g_data[address]			  += uni;

	g_data[address + blockDim.x] += (threadIdx.x + blockDim.x < n) * uni;

}

writes off of the end of the array on the last line. Ocelot detects it as follows:

==Ocelot== Emulator failed to run kernel "_Z10uniformAddPfS_iii" with exception: 

==Ocelot== [PC 33] [thread 16] [cta 0] ld.global.f32 %f5, [%rd11 + 0] - Global memory address 0x1acfbe0 of size 4 is out of any allocated or mapped range.

==Ocelot== Memory Map:

==Ocelot== Device 0 : Ocelot PTX Emulator

==Ocelot==  Global Variable Allocations

==Ocelot==   [0x4dd123] -  [0x4dd123] (0 bytes) (shared)

==Ocelot==	****0x1acfbe0****

==Ocelot== 

==Ocelot==  Device Memory Allocations

==Ocelot==   [0x1aa00e0] - [0x1aa0130] (80 bytes) (global)

==Ocelot==   [0x1abc350] - [0x1ac5f90] (40000 bytes) (global)

==Ocelot==   [0x1ac5fa0] -  [0x1acfbe0] (40000 bytes) (global)

==Ocelot==	****0x1acfbe0****

==Ocelot==

Changing it to

// note two adds per thread

	g_data[address]			  += uni;

	if(threadIdx.x + blockDim.x < n) g_data[address + blockDim.x] += uni;

fixes the problem.

  1. Similarly in OceanFFT in file oceanFFT_kernel.cu starting on line 88:
float2 h0_k = h0[i];

	float2 h0_mk = h0[(((height-1)-y)*width)+x];

	float2 h_tilda = complex_add( complex_mult(h0_k, complex_exp(w * t)),

								  complex_mult(conjugate(h0_mk), complex_exp(-w * t)) );

	// output frequency-space complex values

	if ((x < width) && (y < height)) {

		ht[i] = h_tilda;

	}

This reads in invalid data on the first two lines causing the Ocelot error message:

==Ocelot== Emulator failed to run kernel "_Z13c2c_radix4_spifiPvS_i11tfStride_st" with exception: 

==Ocelot== [PC 41] [thread 63] [cta 0] ld.global.v2.f32 {%f7, %f8}, [%rd14 + 0] - Global memory address 0x7f7fe8161408 of size 8 is out of any allocated or mapped range.

==Ocelot== Memory Map:

==Ocelot== Device 0 : Ocelot PTX Emulator

==Ocelot==  Nearby Global Variable Allocations

==Ocelot==   [0x7f7fedf033c8] -  [0x7f7fedf033c8] (0 bytes) (shared)

==Ocelot==	****0x7f7fe8161408****

==Ocelot==   [0x7f7fee1fe400] - [0x7f7fee1fe588] (392 bytes) (const)

==Ocelot==   [0x7f7fee1fe5c0] - [0x7f7fee1fe748] (392 bytes) (const)

==Ocelot==   [0x7f7fee1fe780] - [0x7f7fee1fe908] (392 bytes) (const)

==Ocelot==   [0x7f7fee1fe940] - [0x7f7fee1feac8] (392 bytes) (const)

==Ocelot==   [0x7f7fee1feb00] - [0x7f7fee1fec88] (392 bytes) (const)

==Ocelot== 

==Ocelot==  Nearby Device Memory Allocations

==Ocelot==   [0x27d6160] - [0x2816160] (262144 bytes) (global)

==Ocelot==   [0x7f7fe805f010] - [0x7f7fe80df010] (524288 bytes) (global)

==Ocelot==   [0x7f7fe8121010] -  [0x7f7fe8161010] (262144 bytes) (global)

==Ocelot==	****0x7f7fe8161408****

==Ocelot==   [0x7f7fe8162010] - [0x7f7fe81a2010] (262144 bytes) (global)

==Ocelot==

it is corrected by including the entire section in the guard:

if ((x < width) && (y < height)) {

		float2 h0_k = h0[i];

		float2 h0_mk = h0[(((height-1)-y)*width)+x];

		float2 h_tilda = complex_add( complex_mult(h0_k, complex_exp(w * t)),

									  complex_mult(conjugate(h0_mk), complex_exp(-w * t)) );

		// output frequency-space complex values

		

			ht[i] = h_tilda;

	}

Running the 2.2 sdk now… Confirmed that the same errors exist in SDK 2.2, but luckily there aren’t any others…

Also, I wrote the following example that shows how the memory checker works:

/*!

*/

#include <string>

__global__ void init(int* data)

{

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	data[tid] = tid;

}

void wrongMemcpyDirection()

{

	int* hostA = new int[128];

	memset(hostA, 0, sizeof(int)*128);

	int* deviceA;

	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );

	cudaMemcpy( hostA, deviceA, sizeof(int)*128, cudaMemcpyHostToDevice );

	cudaFree( deviceA );

	delete[] hostA;

}

void hostPointerOnDevice()

{

	int* hostA = new int[128];

	memset(hostA, 0, sizeof(int)*128);

	int* deviceA;

	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );

	cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );

	init<<< 4, 32 >>>( hostA );

	cudaFree( deviceA );

	delete[] hostA;

}

void bufferOverrun()

{

	int* hostA = new int[128];

	memset(hostA, 0, sizeof(int)*128);

	int* deviceA;

	cudaMalloc( (void**) &deviceA, sizeof(int)*128 );

	cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );

	init<<< 4, 33 >>>( deviceA );

	cudaFree( deviceA );

	delete[] hostA;	

}

int main( int argc, char** argv )

{

	if( argc > 1 )

	{

		if( std::string(argv[1]) == "0" )

		{

			wrongMemcpyDirection();

		}

		else if( std::string(argv[1]) == "1" )

		{

			hostPointerOnDevice();

		}

		else if( std::string(argv[1]) == "2" )

		{

			bufferOverrun();

		}

	}

	else

	{

		wrongMemcpyDirection();

	}

	

	return 0;

}

When run it generates the following outputs:

normal@phenom:~/checkout/gpuocelot/ocelot$ nvcc --cuda memoryErrors.cu 

normal@phenom:~/checkout/gpuocelot/ocelot$ g++ -o memoryErrors memoryErrors.cu.cpp -L .libs/ -lcudart -lOcelotIr -lOcelotParser -lOcelotExecutive -lOcelotTrace -lOcelotAnalysis -lhydrazine

normal@phenom:~/checkout/gpuocelot/ocelot$ LD_LIBRARY_PATH=.libs ./memoryErrors 0

terminate called after throwing an instance of 'hydrazine::Exception'

  what():  Invalid destination 0x1d4da50 ( 512bytes) in host to device memcpy.

Device 0 : Ocelot PTX Emulator

 Nearby Global Variable Allocations

  No Allocations.

Nearby Device Memory Allocations

  [0x1d4dd20] -  [0x1d4df20] (512 bytes) (global)

   ****0x1d4da50****

Aborted

normal@phenom:~/checkout/gpuocelot/ocelot$ LD_LIBRARY_PATH=.libs ./memoryErrors 1

==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception: 

==Ocelot== [PC 9] [thread 0] [cta 0] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0x1921a50 of size 4 is out of any allocated or mapped range.

==Ocelot== Memory Map:

==Ocelot== Device 0 : Ocelot PTX Emulator

==Ocelot==  Nearby Global Variable Allocations

==Ocelot==   No Allocations.

==Ocelot== 

==Ocelot==  Nearby Device Memory Allocations

==Ocelot==   [0x1921d20] -  [0x1921f20] (512 bytes) (global)

==Ocelot==	****0x1921a50****

==Ocelot== 

normal@phenom:~/checkout/gpuocelot/ocelot$ LD_LIBRARY_PATH=.libs ./memoryErrors 2

==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception: 

==Ocelot== [PC 9] [thread 29] [cta 3] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0xf52f20 of size 4 is out of any allocated or mapped range.

==Ocelot== Memory Map:

==Ocelot== Device 0 : Ocelot PTX Emulator

==Ocelot==  Nearby Global Variable Allocations

==Ocelot==   No Allocations.

==Ocelot== 

==Ocelot==  Nearby Device Memory Allocations

==Ocelot==   [0xf52d20] -  [0xf52f20] (512 bytes) (global)

==Ocelot==	****0xf52f20****

==Ocelot==

Heh, neat finds.

I got around to building this yesterday, but I’m getting weird errors on both the regression tests and my app.

Do you have a guess as to what is happening here in the Reduction sample:
Regression: what(): ==Ocelot== For parameter __cudaparm__Z12reduce6_sm10IiLj128ELb1EEvPT_S1_j_g_idata PTX size 8 does not match specified size 4
Regular SDK: reduction: ocelot/executive/implementation/CooperativeThreadArray.cpp:1090: ir::PTXU32 executive::CooperativeThreadArray::operandAsU32(int, const ir::PTXOperand&): Assertion `0 == “invalid address mode of operand”’ failed.

The regression and regular SDK deviceQuery programs happily report the existence of an Ocelot device on my system. Any ideas as to what is happening here?

Ubuntu 8.10, Boost 1.39, GNU 4.3.2

I’m using the most recently posted Ocelot

Thanks much,
Ben

The short answer is that this seems to be a bug with Ocelot on 32-bit systems. I assume that you are running 32-bit ubuntu…

So to interpret what is happening here:

First of all, all of the names are mangled by the NVIDIA compiler to avoid namespace collisions, so __cudaparm__Z12reduce6_sm10IiLj128ELb1EEvPT_S1_j_g_idata refers to the variable g_idata in the kernel reduce6_sm10.

The error means that the CUDA runtime tried to setup a parameter (g_idata in this case) with a variable of size 4 (eg sizeof(g_idata) == 4), but the variable was declared in PTX to be of size 8. This is the declaration in PTX from my copy of reduction:

.param .u64 __cudaparm__Z12reduce6_sm10IiLj512EEvPT_S1_j_g_idata;

This is a problem because the pointer size on 32-bit systems is 4 bytes and 8 bytes on 64-bit systems… I think that the correct behavoir should be to set the lower 32-bits and zero out the upper 32.

Let me try to find a 32-bit system and see if I can reproduce this…

I was able to reproduce this bug on a 32-bit ubuntu system. See the following change for a fix:

http://code.google.com/p/gpuocelot/source/detail?r=50

I’ll post an updated package after running through all of the regression tests…

Alright! Sounds great. I’ll try out the patches for now, but I’ll be switching to a 64bit system anyway, so it’s not a pressing problem. I was just using my home machine for filesystem convenience :).

Thanks for the response (and patch),
Ben