Those of you who work with wrapper libraries for the CUDA runtime may have noticed the quiet introduction of ‘cudaGetExportTable’ and ‘cuGetExportTable’ around version 3.0 of the runtime and that several nvidia libraries such as cublas and cufft started making calls into them.
This has been particularly painful for re-implementations of the CUDA runtime, such as the one in Ocelot or other GPU emulators/simulators, because they are not documented and as we will see, extend the CUDA API to include internal driver functions. I spent some time poking around these functions and believe that I have an explanation of what they do and how to work around them if the CUDA driver is not installed.
My first successful implementation of ‘cudaGetExportTable’ looked like this:
cudaError_t cuda::CudaRuntime::cudaGetExportTable(const void **ppExportTable,
const cudaUUID_t *pExportTableId) {
report("Getting export table");
cuda::CudaDriver::cuInit(0);
CUcontext context;
cuda::CudaDriver::cuCtxCreate(&context, 0, 0);
cuda::CudaDriver::cuGetExportTable(ppExportTable, pExportTableId);
return cudaSuccess;
}
This passes the parameters straight through into a newly created driver context, but the rest of the calls are executed on Ocelot’s emulated device. Getting this far allowed me to determine a few things:
-
The export table ppExportTable returned by the driver is an array of pointers. Manual inspection with a debugger showed that these addresses were mapped into the address space of the driver.
-
Assembly level debugging showed that the internal functions in CUBLAS and CUFFT retrieve elements from this array of pointers and call them using a standard calling convention.
-
From 1) and 2), I concluded that the export table is a list of function pointers into internal driver functions.
Armed with this knowledge I continued to step through CUBLAS assembly until several of these function pointers were called, and their results were returned. My options at this point was to dump the assembly of the individual functions being called and re implement them in C (which would consume more of my time than I would like and I would have to deal with calls to other internal driver functions ad naseum) or write some dummy functions that returned close enough results. The following code also passes through enough of CUBLAS and CUFFT to execute the SDK samples.
// This is a horrible hack to deal with another horrible hack
// Thanks nvidia for creating a backdoor interface to your driver rather than extending the API in a sane and documented way
int dummy0() { return 0; }
int dummy1() { return 2 << 20; }
typedef int (*ExportedFunction)();
static ExportedFunction exportTable[3] = {&dummy0, &dummy1, &dummy1};
cudaError_t cuda::CudaRuntime::cudaGetExportTable(const void **ppExportTable,
const cudaUUID_t *pExportTableId) {
report("Getting export table");
*ppExportTable = &exportTable;
return cudaSuccess;
}
I thought that I would post this here in case anyone else was having trouble simulating CUBLAS/CUFFT.
Someone could probably take this further by going back to my first approach and memcopying out the regions pointed to by the function pointers in the table returned by the driver, disassembling and decompiling them, then re-implementing them in a more correct way. Or nvidia could actually document these functions…