The patch worked wonders. Most of the regression tests are running now, so that is good.
On a small app of mine, I’m getting these errors:
queue: ocelot/executive/implementation/CooperativeThreadArray.cpp:1090: ir::PTXU32 executive::CooperativeThreadArray::operandAsU32(int, const ir::PTXOperand&): Assertion `0 == “invalid address mode of operand”’ failed.
This doesn’t seem to be an issue with your 32bit system. This error means that an instruction tried to read a register whose type was not bound. This really should not happen under any circumstances and probably indicates a bug in our PTX parser or a register not getting set during allocation.
I tried downloading and running the SGEMM function in the linpack library that you posted, but it ran through my copy without incident. If you want to send me a copy of your code I could take a look at it. If you don’t want to do that, you might want to try looking at the debugging guide for Ocelot… http://code.google.com/p/gpuocelot/wiki/Debugging
Also, if you sent me the object files for your app, I should be able to pull out all of the PTX source and watch the device function calls from that… which should be enough to figure out what is wrong.
I finally got around to building on a different system. The error goes away on a 64bit Ubuntu 8.10 machine.
If this could still be useful for you, I can package up the code and send it your way sometime tomorrow. It’s not the prettiest stuff in the world, so keep a trashcan near if you have a weak stomach :).
Sounds nice… but like everythign else out there, runtime api only… which essentially makes it useless for production environments where the Runtime API is unacceptable. (For various reasons, from linking, to an additional dll to redist (which adds another security issue), etc)
I’m curious why most of these emulators don’t implement the Driver API, and build the Runtime API on top of the Driver API… just seems so obvious to me…
Nice work in any case, invaluable to those who can actually spend time maintaining an additional Runtime API codebase for debugging purposes. (Oh how I envy you)
The main reason was that building on top of the Driver API would require us to either reimplement the Runtime API ourselves or use NVIDIA’s. We wanted to be able to statically link our programs so we couldn’t use NVIDIA’s implementation since it only ships as a .so. A secondary reason is that we eventually want to export a dynamic CUDA device that selects either an x86 target or a GPU target depending on the characteristics of the kernel being launched. To do this, we need to be able to launch kernels on a GPU from within our runtime implementation, and we plan to use the driver api internally for this.
If it is really valuable to you and others we might consider doing a Driver API port at some later time. I would estimate it being about a solid week or two’s worth of work. I might end up doing it over winter break if I’m bored…
Ah, it’s just a Cuda 2.2 + Makefile with OMP timers test, so it’d probably be an easy build.
I have my stuff running (somewhat) happily in the emulator now.
How functional are the tools in the bin directory? I got CFG to build me some cool directed graphs, but I didn’t see an obvious way to get the other binaries to work.
Is there any way to automatically get the traces (rather than using the API stuff in TestTrace.cpp)?
If not, is there a way to launch kernels more easily?
In general, are there any opportunities for me to be lazy but still get interesting data for analysis :)?
I’ll get that stuff packed up and sent to you in a few.
I think that all of the programs in the bin directory should work. DFG and CFG take input PTX programs, run DFG --help for info. The trace analyzers require you to point them at traces to generate output. Each trace type might use a different format so they all need different analyzers.
The TestTrace.cpp example is very cumbersome to use because it uses the internal trace generator API. In addition to this, there are hooks into the CUDA Runtime API to add a trace generator. Right now, there are several prebuilt trace generators in ocelot/cuda/interface/CudaRuntime.h . Enable them by turning on CUDA_GENERATE_TRACE and selecting one of them. It will get called for each kernel that is launched. I do not include this in the documentation because it is going to change very soon, and it is disabled by default because traces can get very big very fast.
I am planning on adding an additional API call that can be used to bind a specific trace generator to a specific kernel as in:
It looks like the long arguments (size > 2) cause the asserts in Hydrazine’s argument parser to fail.
The machine-readable flag is set to true manually in quite a few of the function calls. I went through and got rid of them and the analyzer is happily spitting out human-readable information now.
For the -Koverlapped argument, I get output that looks like:
Kernel sgemmNN_dist
path: /home/bales/jimkernel_queue/traces/sgemmNN_dist_2_5.trace
module: sgemm.cu
global addr range: 0x6bc600 - 0x979568
working set size: 2871144 bytes
segments of 128 bytes: 22431
global OOB refs: 0
global stored words: 8000
global load words: 22800
x-cta load words: 800
How could I intelligently interpret this?
Also, for the Histogram, is the output format:
Address, Number of Accesses
Have you considered reimplementing the Driver API as some kind of filter over the actual NVIDIA library?
For instance, by creating a few simlinks:
ln -s /path_to_ocelot/ocelotcuda.so cuda.so
ln -s /path_to_cuda_driver/cuda.so nvcuda.so
Then linking your ocelotcuda.so with nvcuda.so…
That would require symlinks all over the place, but seems feasible…
I don’t know how much of it can be reused, but you could use the implementation of the Driver API from Barra as a base or reference. It’s licensed under BSD, and I will be happy to share it if it can help.
I would actually recommend asking about this on the Ocelot mailing list ( http://groups.google.com/group/gpuocelot?pli=1 ). There is another student in my lab ( Andrew Kerr ) who wrote the memory trace analyzers so I would ask him directly. If I had to guess, I think that the intent of this type of analyzer in overlapped mode was to track data sharing between CTAs in consecutive kernels. So I think that 800 x-cta load words means that 800 memory accesses read data that was produced by a previous CTA rather than via a cudaMemcpy. I could be wrong about this though.
If you have any questions about the Branch or Parallelism traces, I did actually write those and have much more info on them :)
We hadn’t considered this actually, thanks for the tip. I would probably like to stick with the cuda level api implementation that we have now since it enables static linking, but in the future I will definitely consider this. I’ll also take a look at Barra’s driver level api implementation when we start working on this.