PTX compute 2.0 not forward compatible with GTX 970 ?!

Hello,

This kernel was generated for compute 2.0 compatibility and thus I think it should be forward compatible with higher compute capabilities, for example GTX 970.

However users of this kernel are reporting a launch failure when dimensions are:

BlockWidth: 1024
BlockHeight: 1
BlockDepth: 1
GridWidth: 13691407
GridHeight: 1
GridDepth: 1

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Mar 15 02:52:22 2014 (1394848342)
// Cuda compilation tools, release 6.0, V6.0.1
//

.version 4.0
.target sm_20
.address_size 32

.visible .entry KernelBandwidth(
.param .u32 KernelBandwidth_param_0,
.param .u32 KernelBandwidth_param_1
)
{
.reg .pred %p<2>;
.reg .s32 %r<24>;
.reg .f32 %f<2>;

ld.param.u32 	%r2, [KernelBandwidth_param_0];
ld.param.u32 	%r3, [KernelBandwidth_param_1];
cvta.to.global.u32 	%r1, %r2;
mov.u32 	%r4, %ctaid.z;
mov.u32 	%r5, %nctaid.y;
mov.u32 	%r6, %ctaid.y;
mad.lo.s32 	%r7, %r4, %r5, %r6;
mov.u32 	%r8, %nctaid.x;
mov.u32 	%r9, %ctaid.x;
mad.lo.s32 	%r10, %r7, %r8, %r9;
mov.u32 	%r11, %ntid.z;
mov.u32 	%r12, %tid.z;
mad.lo.s32 	%r13, %r10, %r11, %r12;
mov.u32 	%r14, %ntid.y;
mov.u32 	%r15, %tid.y;
mad.lo.s32 	%r16, %r13, %r14, %r15;
mov.u32 	%r17, %ntid.x;
mov.u32 	%r18, %tid.x;
mad.lo.s32 	%r19, %r16, %r17, %r18;
rem.s32 	%r20, %r19, %r3;
shl.b32 	%r21, %r20, 4;
add.s32 	%r22, %r1, %r21;
ld.global.f32 	%f1, [%r22];

// ld.volatile.global.f32 %f1, [%r22];
// st.global.f32 [%r22], %f1;
ret;
}

My application uses the driver api, so the driver should jit compile this PTX to something that can execute on GTX 970. This seems to fail ? Thus I consider this a “compatibility bug”.

Are bugs like these going to be fixed some day ?!?

Recompiling this kernel to higher compute capability for example 3.5 would give different looking ptx:

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Mar 15 02:52:22 2014 (1394848342)
// Cuda compilation tools, release 6.0, V6.0.1
//

.version 4.0
.target sm_35
.address_size 32

.weak .func (.param .b32 func_retval0) cudaMalloc(
.param .b32 cudaMalloc_param_0,
.param .b32 cudaMalloc_param_1
)
{
.reg .s32 %r<2>;

mov.u32 	%r1, 30;
st.param.b32	[func_retval0+0], %r1;
ret;

}

.weak .func (.param .b32 func_retval0) cudaFuncGetAttributes(
.param .b32 cudaFuncGetAttributes_param_0,
.param .b32 cudaFuncGetAttributes_param_1
)
{
.reg .s32 %r<2>;

mov.u32 	%r1, 30;
st.param.b32	[func_retval0+0], %r1;
ret;

}

.visible .entry KernelBandwidth(
.param .u32 KernelBandwidth_param_0,
.param .u32 KernelBandwidth_param_1
)
{
.reg .pred %p<2>;
.reg .s32 %r<24>;
.reg .f32 %f<2>;

ld.param.u32 	%r2, [KernelBandwidth_param_0];
ld.param.u32 	%r3, [KernelBandwidth_param_1];
cvta.to.global.u32 	%r1, %r2;
mov.u32 	%r4, %ctaid.z;
mov.u32 	%r5, %nctaid.y;
mov.u32 	%r6, %ctaid.y;
mad.lo.s32 	%r7, %r4, %r5, %r6;
mov.u32 	%r8, %nctaid.x;
mov.u32 	%r9, %ctaid.x;
mad.lo.s32 	%r10, %r7, %r8, %r9;
mov.u32 	%r11, %ntid.z;
mov.u32 	%r12, %tid.z;
mad.lo.s32 	%r13, %r10, %r11, %r12;
mov.u32 	%r14, %ntid.y;
mov.u32 	%r15, %tid.y;
mad.lo.s32 	%r16, %r13, %r14, %r15;
mov.u32 	%r17, %ntid.x;
mov.u32 	%r18, %tid.x;
mad.lo.s32 	%r19, %r16, %r17, %r18;
rem.s32 	%r20, %r19, %r3;
shl.b32 	%r21, %r20, 4;
add.s32 	%r22, %r1, %r21;
ld.global.f32 	%f1, [%r22];
ret;

}

The difference is minor… anyway my application should not require a re-compile of kernels ?!?

For now I have no choice and will have to offer multiple PTX versions so users can run it.

I am not sure yet if re-compiling the kernel and generating new ptx would solve their problem, but according to this link it might:

http://stackoverflow.com/questions/26676419/maximum-blocks-number-on-a-gtx-titan

Seems like this is your problem:

I did cut out these lines from the PTX:

setp.neu.f32	%p1, %f1, 0f4640E400;
@%p1 bra 	BB0_2;

mov.u32 	%r23, 0;
st.global.u32 	[%r1], %r23;

BB0_2:

Perhaps this declaration at the top is causing problems:

.reg .pred 	%p<2>;

Perhaps the higher compute capability is checking for these Ps to be present. Just a hypothesis.

My response to allanmac:

I am aware of dimension limitations per compute capability. Since the higher compute capabilities have more blocks available I don’t see why a kernel compiled for compute capability 2.0 could not function on a device of higher compute capability users more blocks ?

This seems more of a limitation for “runtime” and not “compile time” ?

Then again… the ptx is “half” of the compiler work. The “just-in-time” compiler will have to complete the final step.

The PTX code does mention:

.target sm_20

However the just-in-time compiler should be aware that it’s running on a higher compute capable card for example during the loading ?

Perhaps that’s part of the problem… perhaps the driver api… does not know yet on what kind of device it’s running when the kernel is loaded ? Or perhaps this was done delibaretly…

Whatever the case may be I do not agree with limiting the kernels to compute 2.0 capabilities, just because the kernel was written for a compute capability 2.0… that does not make sense and is not future compatible.

Maybe there are more complex reasons why my application might have to limit the number of thread blocks to compute 2.0 capability… but at least in this case for this simple kernel… I can’t really see why… except maybe register issues/register allocations etc… and then again… this is just PTX code…

The PTX code is suppose to be future compatible, so the PTX code has to be compiled to higher compute capabilities anyway if it’s to run on newer graphics cards ?!?

Now this silly setting/limitation is causing a launch failure ? I find this very weird… since this launch failure seems to be happening at “run time”.

So the “driver api” and/or “the just-in-time compiler” should be able to tell, that it’s ok to lift this limitation ?

Sometimes higher compute capabilities can have reduced numbers, however in this case the compiler/driver could check if the higher capability has more blocks available and thus it can honor the api launch request…

So again my question is: Why cause a seemingly unnecessary launch failure ???

I think the driver/just-in-time-compiler might be following these restrictions a little bit too tightly ?!?

Then again I can see why this might be a problem for a higher compute capability developer wanting to test his kernels on a lower compute capability setting… but this means he has to compile to a lower compute capability architecture… which basically has nothing to do with these block/grid dimensions ?! If he wants to test his kernel properly… he has to limit his block/grid dimensions during launch towards these lower compute capability constraints.

So I consider Table 12 to be a “run time” specification and limitation.

If during runtime compute capability 5.2 is detected then my launch parameters should be accepted even if it was compiled for 2.x ?!?

Now I haven’t looked/read the doc in a while… if the doc says, these are compile AND runtime limitations then it might be a different matter ?!?

Perhaps you can point me/us towards it says that these limitations are for RUN TIME as well ?!?

Whatever the case may be… it’s clear to see how programmers can interpret these tables differently:
COMPILE-TIME versus RUN-TIME limitations.

Since threads/grids have little to do with COMPILE-TIME and much more with RUN-TIME it makes no-sense to fail the kernel. I think I have made my point very clear now.

But I will consult the manual anyway… to see if there is some more info about it. Very weird though.

Ok this issue has been resolved in another thread called: “kernel compute capability versus device compute capability”.

It seems better to limit resources to the kernel’s compute version as it was written during that time frame. Fortunately the dimensions are still quite large… however… perhaps in the future… launches might still fail… which would kinda suck… but again… maybe software should check for absolute limits/maximums and warn if maximums are exceeded.

I think a good solution is to compute a minimum compute capability specification based on compute capability specifications of kernel vs device. This to garantee launches.

Unfortunately there is a big problem with my solution, and CUDA in general.

It seems to be impossible to tell for which compute capability a kernel was compiled, once the kernel is loaded via the driver api.

So the question is basically:

Is there a driver API which can be used to discover the compute capability version to which the loaded kernel was compiled for ?

So far I have seen two fields of some interest:

PTX version
Binary version

However I doubt that these fields correspond directly to a compute capability version.

THIS IS A BIG PROBLEM ?!

How am I suppose to apply the correct compute capability settings if my software cannot determine the compute capability version of a PTX kernel ?!?!

(I will make new thread about this to discuss it there futher because this is a serious problem)