Hi,
I was trying to figure the time taken to compile OpenCL code through NVIDIA’s OpenCL compiler on windows. Does the clBuildProgram handle he job of (compiling & linking) or it offloads some of it’s functionalities to clEnqueueNDRangeKernel, clCreateKernel or any other API.
Not sure about this, and could not find any mention of it in the spec, but I vaguely remember from somewhere that clBuildProgram may use background threads to do the actual compiling and linking. If that is the case, then any call to the cl*Kernel functions will probably have to wait for the compilation to finish. But as I said, not sure where I got this from.
I’ve timed compilation in the past by only timing the call to clBuildProgram(), and it seemed to work fine. However, if you’re calling clBuildProgram() multiple times in a row, subsequent calls may require less time as the compiler itself does not need to be loaded to memory anymore. You may be able to unload the compiler in between for consistent numbers by calling clUnloadCompiler(), but that’s only a hint to the implementation. So the better way would be a dummy call to clBuildProgram() to make sure the compiler is already loaded for all compilations, and then do these compilations quickly after each other (the compiler might unload after a certain amount of time clBuildProgram() has not been called).
Also, to be on the safe side, you could try to force-trigger the compilation by requesting the binaries via clGetProgramInfo() / CL_PROGRAM_BINARIES as part of the timing …
I suspect determining these numbers is not too simple, especially if you do not have information on how NVIDIA’s runtime works internally.
I know for sure only about Apple’s and AMD’s platforms that they both cache intermediate code (the Apple platform is even caching too aggressively, not regenerating PTX for changed preprocessor macros and so forth). For NVIDIA I can only suspect it, but caching of the GPU instruction stream generated from PTX might even be out of OpenCL’s influence and handled in the CUDA backend.
If anyone is willing to measure and check all that I’d be curious myself, but am not be willing to spent the time myself - especially as it may change with every new OpenCL and CUDA driver release.
So, do I understand you correctly? : If I have a file with several kernels ,it ought generate for each kernel its own file.And not compile the program with several kernels ,but compile
sequently one program after another (for each kernel).This should minimize the compilation time?
(And what did you meen by “multiple times in a row”?)
Another question regarding :
How have you timed the clBuildProgram()?Did you use some openCL/Profiling function/tool or just clock()?
Well, what I was seeing was this: If I compile OpenCL source code, and immediately afterwards compile the same source code again, the second compilation takes less time than the first one. This is probably due to two things, 1) the first call to the OpenCL compiler loads the compiler itself to memory, which doesn’t need to be done for the second call anymore, 2) the driver might cache compilation results and not compile anything actually if the compilation result of some source code is already known.
If the second source code is different than the first, you’ll still benefit from the compiler being already loaded to memory. However, the time you save should be negligible compared to the compilation time, so I would not recommend to artificially split existing source code into multiple files, as this will make you using more API calls to load the source, again requiring more time.
I’ve using my own timing class on Windows based on QueryPerformanceCounter().