Pre-Compiling OpenCL Kernels Tutorial

I was roaming through many forums trying to find some concrete examples on how to pre-compile OpenCL kernels. I was never successful in finding all the pieces in one spot. Here are a few functions I have created, gathered, and modified in order to accomplish this. First you need to write out the binary file using the code in the first code block. Call this with clCreateProgramWithSource and after clBuildProgram.

[codebox]void writeBinaries()


ofstream myfile("kernel.ptx");

cl_uint program_num_devices;

clGetProgramInfo(cpProgram, CL_PROGRAM_NUM_DEVICES, sizeof(cl_uint), &program_num_devices, NULL);

if (program_num_devices == 0)


		std::cerr << "no valid binary was found" << std::endl;



size_t binaries_sizes[program_num_devices];

clGetProgramInfo(cpProgram,	CL_PROGRAM_BINARY_SIZES, program_num_devices*sizeof(size_t), binaries_sizes, NULL);

char **binaries = new char*[ciDeviceCount];

for (size_t i = 0; i < ciDeviceCount; i++)

		binaries[i] = new char[binaries_sizes[i]+1];

clGetProgramInfo(cpProgram, CL_PROGRAM_BINARIES, program_num_devices*sizeof(size_t), binaries, NULL);



	for (size_t i = 0; i < program_num_devices; i++)


			myfile << binaries[i];




for (size_t i = 0; i < program_num_devices; i++)

		delete [] binaries[i];

delete [] binaries;


Next, you will need to comment out your load program from source routine (ex. oclLoadProgSource), clCreateProgramWithSource and writeBinaries(). After this you will need to add this code.

[codebox]FILE* fp = fopen(“oclLLtoUTM.ptx”, “r”);

fseek (fp , 0 , SEEK_END);

const size_t lSize = ftell(fp);


unsigned char* buffer;

buffer = (unsigned char*) malloc (lSize);

fread(buffer, 1, lSize, fp);


cl_int status;

cpProgram = clCreateProgramWithBinary(cxGPUContext, 1, (const cl_device_id *)cdDevices, 

			&lSize, (const unsigned char**)&buffer, 

			&status, &ciErr1);

if (ciErr1 != CL_SUCCESS)


    cout<<"Error in clCreateProgramWithBinary, Line "<<__LINE__<<" in file "<<__FILE__<<" "<<endl;



ciErr1 = clBuildProgram(cpProgram, 0, NULL, NULL, NULL, NULL);[/codebox]

This will now read in your ptx file and create the binary. That is all there is to it.


Hey there. Thanks for posting this.

The CUDA docs (old ones) say that this should only work if produced and consumed by the same driver, and might be removed in future versions.

Any idea what the reality behind this is? Is the ptx format a viable option for distribution to multiple NVIDIA devices?