Get program binaries How to get the program binaries using the STL bindings?

andradx · January 23, 2012, 5:12pm

Hi, I’m trying to get the binaries for my OpenCL program, trying to speed up the runtime of my application. In that way, I was hoping to reduce the overhead due to the program building, so instead of loading a source, I’d load a binary into the program.

Question is, how do I do this using the cl.hpp bindings?

In this tutorial, http://www.mss.cbi.uni-erlangen.de/KkuDatabase/files/files/316/main.pdf, there’s the following code snippet:

// Allocate some memory for all the kernel binary data

const std::vector<size_t> binSizes = program.getInfo<CL_PROGRAM_BINARY_SIZES>();

std : : vector<unsigned char> binData ( std::accumulate (binSizes . begin(),binSizes.end(),0) ) ;

unsigned char âˆ— binChunk = &binData[0] ;

//A list o f pointers to the binary data

std::vector<unsigned char*> binaries ;

for(size_t i = 0;i<binSizes.size();++i ) {

  binaries . push_back ( binChunk ) ;

  binChunk += binSizes [ i ] ;

}

program.getInfo(CL_PROGRAM_BINARIES , &binaries[0] ) ;

std::ofstream binaryfile ( "kernels.bin" ) ;

if(!binaryfile.good( ) ) std : : runtime_error ( "Failed to open kernels.bin for reading" ) ;

for ( size_t i = 0 ; i < binaries . size ( ) ; ++i )

   binaryfile << binaries [ i ] ;

However, this snippet isn’t working for Nvidia binaries. When I compile for Intel OpenCL it generates the .bin file, although it’s only 5 bytes, so I suspect it’s not working either for Intel. For Nvidia I get the -30 error, which is Invalid Value. Meaning that probably there is a size mismatch in where I’m trying to save the binaries.

Can you help with this? How can I retrieve the binaries using the STL bindings and write it to file.

TY.

eyebex · January 23, 2012, 5:18pm

Depending on your version of cl.hpp you may suffer from bug 254. As I was affacted by this bug, too, I’ve created a fork of cl.hpp at repo.or.cz with this (and a few other) issues fixed. Get the file from here.

andradx · January 23, 2012, 5:35pm

I still get the same -30 error with the new patched cl.hpp. Can you tell me if the code is OK? My knowledge in vector class is somewhat limited, to say the least. Or can you provide me with an alternative way to get the binaries using the STL bindings?

TY

eyebex · January 24, 2012, 9:13am

With the patched cl.hpp the API has changed slightly. You need to use it like this:

typedef VECTOR_CLASS<cl::Device> DeviceList;

DeviceList devices=g_context->getInfo<CL_CONTEXT_DEVICES>();

// ...

cl_int result;

for (size_t d=0;d<devices.size();++d) {

    // For NVIDIA, show the PTX source code in case of compile errors.

    VECTOR_CLASS< VECTOR_CLASS<unsigned char> > ptx=g_program->getInfo<CL_PROGRAM_BINARIES>(&result);

    if (result==CL_SUCCESS && !ptx.empty()) {

        // Ensure the PTX source code is NULL-terminated.

        ptx[d].push_back('

typedef VECTOR_CLASScl::Device DeviceList;

DeviceList devices=g_context->getInfo<CL_CONTEXT_DEVICES>();

// …

cl_int result;

for (size_t d=0;d<devices.size();++d) {

// For NVIDIA, show the PTX source code in case of compile errors.

VECTOR_CLASS< VECTOR_CLASS<unsigned char> > ptx=g_program->getInfo<CL_PROGRAM_BINARIES>(&result);

if (result==CL_SUCCESS && !ptx.empty()) {

    // Ensure the PTX source code is NULL-terminated.

    ptx[d].push_back('\0');

    printf("Program binary for device %d:\n%s",d,&ptx[d].front());

}

}

');

        printf("Program binary for device %d:\n%s",d,&ptx[d].front());

    }

}

andradx · January 24, 2012, 11:07am

With the patched cl.hpp the API has changed slightly. You need to use it like this:

typedef VECTOR_CLASS<cl::Device> DeviceList;

DeviceList devices=g_context->getInfo<CL_CONTEXT_DEVICES>();

// ...

cl_int result;

for (size_t d=0;d<devices.size();++d) {

    // For NVIDIA, show the PTX source code in case of compile errors.

    VECTOR_CLASS< VECTOR_CLASS<unsigned char> > ptx=g_program->getInfo<CL_PROGRAM_BINARIES>(&result);

    if (result==CL_SUCCESS && !ptx.empty()) {

        // Ensure the PTX source code is NULL-terminated.

        ptx[d].push_back('

typedef VECTOR_CLASScl::Device DeviceList;

DeviceList devices=g_context->getInfo<CL_CONTEXT_DEVICES>();

// …

cl_int result;

for (size_t d=0;d<devices.size();++d) {

// For NVIDIA, show the PTX source code in case of compile errors.

VECTOR_CLASS< VECTOR_CLASS<unsigned char> > ptx=g_program->getInfo<CL_PROGRAM_BINARIES>(&result);

if (result==CL_SUCCESS && !ptx.empty()) {

    // Ensure the PTX source code is NULL-terminated.

    ptx[d].push_back('\0');

    printf("Program binary for device %d:\n%s",d,&ptx[d].front());

}

}

');

        printf("Program binary for device %d:\n%s",d,&ptx[d].front());

    }

}

Thank you for this precious snippet. Although, I can get the ptx code for Nvidia, but not for ATI/AMD platform. For such platforms I’m aware we can set an environment variable and have the CAL code dumped, I was hoping the OpenCL API would do this automatically for all platforms. For the AMD case, I get ELF.

In any case, I’m trying to get cross-platform and cross-device portability, so at least I’ll need to compile for a given machine the source code once, but afterwards I can load only the binaries. Do you reckon that a significant amount of time can be saved by loading a binary instead of compiling the source?

eyebex · January 24, 2012, 11:22am

That’s right.

That really depends on your OpenCL platform and kernel source code size. In the early days, I remember the ATI Stream CPU platform to take quite long for compiling the kernel code. However, subsequent compiles were much faster because the compiler was already loaded into memory. For my rather small kernels, I guess the time savings come from not loading the compiler rather than sparing the time to compile the source code when loading binaries. So, while it’s certainly good practice to pre-compile your kernel on the user’s machine on the first run of the application, I would probably add this feature late in the development process as an optimization. And I would surely not ship pre-compiled binaries, as it’s almost certain these will not work on the user’s machine due to different hardware / drivers.

andradx · January 24, 2012, 11:37am

That’s right.

That really depends on your OpenCL platform and kernel source code size. In the early days, I remember the ATI Stream CPU platform to take quite long for compiling the kernel code. However, subsequent compiles were much faster because the compiler was already loaded into memory. For my rather small kernels, I guess the time savings come from not loading the compiler rather than sparing the time to compile the source code when loading binaries. So, while it’s certainly good practice to pre-compile your kernel on the user’s machine on the first run of the application, I would probably add this feature late in the development process as an optimization. And I would surely not ship pre-compiled binaries, as it’s almost certain these will not work on the user’s machine due to different hardware / drivers.

You’re right about the pre-compiled binaries.

Now that you mention it, it actually takes a long time to load the compiler, especially for Intel and AMD, Nvidia compiler is pretty, pretty fast, though. So, despite 2k code lines in the kernels I’m testing (most of which are repetition of code portions after unpacking words out of a vector), the compile time seems reasonable enough to get by without loading binaries. I got around the compile latency in AMD by stalling the application after compilation and taking input arguments after the program and kernels were built, so I could test for different parameters, but manual inputs are productivity killers when compared to a bash script calling a fully compiled executable with different arguments IMHO.

Once again, thank you for the patched cl.hpp, now I can inspect my kernel’s ptx.