Is there a chance that Ptxas.exe will use all cores of the CPU ? This would be a great improvement o

After adding 10 lines of code to my kernel compilation time has grown from 4 minutes to hours. Ptxas uses the only core of the CPU … so, multi-threaded ptxas can be very valuable CUDA improvement.

This rather sounds like a bug, or like excessive unrolling. How large are the generated files, before and after adding the ten lines?

That’s definitely an algorithmic problem in ptxas. My kernel is rather big - cubin files are about 400K in both cases (with and without those magic 10 lines of code), seems like ptxas does not do it’s job fine for large kernels. The only thing that is done in those 10 lines is an extra access to the shared memory, no loops that can be unrolled or something - and that one extra access increases the time of compilation from 4 minutes to 3 hours.

I have another interesting observation regarding the reliability of ptxas :-) If to specify the noinline for all device functions I use (about 50 large functions) the kernel will be compiled in just one minute, however, it will not work at all producing the unspecified launch failure as a result (I work with Fermi and compile for sm20).

Are you using the CUDA 3.2 toolchain? If so, the dramatic increase in PTXAS compile time resulting from a minor source code change is something that our compiler team should look into. If you are a registered developer, it would be helpful if you could file a bug. Thanks.

I’m using CUDA 3.1. Should I try to move to 3.2 ? If it really makes sense (if compiler in 3.2 contains some serious fixes or improvements) please confirm.

New releases of CUDA include a significant number of bug fixes in addition to the various new features and performance improvements they provide. The only way to find out whether the CUDA 3.2 toolchain addresses the particular issue you are encountering with the CUDA 3.1 toolchain is to give it a try. If it turns out that the compiler from CUDA 3.2 doesn’t fix the problem, it would be very helpful to file a bug.

OK, I’ll give a try to 3.2 and let you know the results. Thanks for your help!

I have installed 3.2 and compiled my project with it. Compilation time has dropped from hours to 20 seconds … but that’s how things are when the GPU debug info generation option is on :-) After turning it off I see the following:

  1. Compilation time is still huge. It already works for 10 minutes without any hints how long will it take in total.

  2. Memory consumption has grown significantly: ptxas 3.1 consumed about 3Gbytes while ptxas 3.2 eats almost 4GBytes.

  3. Even the code without those magic 10 lines that make ptxas work for infinity began to work MUCH SLOWER! Test run now lasts 98 seconds instead of 80 seconds.

As to the compilation, the behaviour of ptxas 3.1 and ptxas 3.2 is almost the same: ptxas.exe process takes a lot of memory and than takes 100% of one CPU core for 3 hours. In general, it seems like the problem is still in the compiler.

I am the registered developer but I’m a bit confused on posting the bug report. There is absolutely no chance to extract the test case to reproduce the issue - the project is huge and complex, I can hardly think of extraction.

Just in case … Here is the code that makes the ptxas work for three hours when added:


		if (!bSkipSet && nPos > 0 && ((nSkip = SKIP(nPos - 1)) != 0))



			if (fValue == GPTRUE)


				GPINT nSkipSize1 = (nSkip >> 16);

				GPINT nSkipSize2 = (nSkip << 16) >> 16;

				nSkipStart = nPos + nSkipSize1;

				nSkipLength = nSkipSize2;




				GPINT nSkipSize1 = (nSkip >> 16);

				nSkipStart = nPos;

				nSkipLength = nSkipSize1;


			bSkipSet = true;

		} // if (nSkip != 0)

		if (bSkipSet && nPos == nSkipStart)


			nPos += nSkipLength;

			STACK_PUSH(Parms, 0); // Dummy substitution of skipped operations

			bSkipSet = false;



SKIP(nPos - 1), STACK_VALUE_TOP(Parms) and STACK_PUSH(Parms, 0) are the macros that reference the float numbers in the shared memory. My experiments show that removal of one of shared memory interaction (SKIP(nPos - 1)) make ptxas work for six minutes instead of 3 hours. Also, I can admit that ptxas 3.2 works for almost 6 minutes for the code that is handled in 4-5 minutes by ptxas 3.1.

The real concern is not about the speed of compilation but about the speed of work, 98 seconds is 22.5% slower.

Sorry to hear the 3.2 toolchain didn’t fix the compilation time issue. A repro case is indispensable for the compiler team to get to the bottom of an issue such as this one. Since the problem (per your diagnosis) seems to be in PTXAS, all that should be needed for repro is the one .ptx file that causes PTXAS to get “stuck”, and the commandline for the corresponding PTXAS invocation (from nvcc --verbose).

Understood. How should I provide the .ptx ? Via personal message to you ?

I would be more advantageous for you to file a bug, attaching the .ptx file (obtained with --keep) and the ptxas commandline invocation (obtained with --verbose) which should allow the compiler engineers to reproduce the issue with the excessive compilation time. As the filer of the bug, you will then have visibility into the bug’s progress.

If you prefer, you could also send the file in question and the commandline information to me in a personal message via the forum (it is possible to attach files to PMs). Any internally filed bug would however be visible only within NVIDIA.