ptx question

Hello!

I have a problem with a kernel. It uses too much registers. When I look in the ptx file I found the following:

mov.f32  %f1, 0f00000000;      // 0

	mov.f32  %f2, %f1;            	// 

	mov.f32  %f3, 0f00000000;      // 0

	mov.f32  %f4, %f3;            	// 

	mov.f32  %f5, 0f00000000;      // 0

	...

	mov.f32  %f59, 0f00000000;    	// 0

	mov.f32  %f60, %f59;          	// 

	mov.f32  %f61, 0f00000000;    	// 0

	mov.f32  %f62, %f61;          	// 

	mov.f32  %f63, 0f00000000;    	// 0

	mov.f32  %f64, %f63;          	//

The actual C code looks like this:

float c1[16] = {0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f};

	float c2[16] = {0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f};

Instead of using 32 registers for the two arrays, the compiler uses 64…It is interessting, that the even number registers are never used as a source. But whenever the odd ones are set, a copy of them is stored in the even ones…

Does anybody know that happened there?

Thanks!

Moritz

You’re looking at PTX. It’s not real assembly code. It will be re-optimized when compiled into a cubin. You might want to try continuing your investigation using decuda

Thank you for the tip.

Seems like ptx has not much to do with the actual assembly.

I think PTX has got very much to do with assembly… But it would be further optimized by the cuda run-time system… So, I dont think even CUBIN has the actual binary output.

It all depends on the CUDA run-time system…

I might be wrong though… Just my inferences from nvcc -help description…(-code and -arch options)

As far as I know, cubin is the actual machine code.