PTX cvta and isspace Unusual behavior with cvta.shared and isspace.shared

I’m experiencing some unusual behavior for PTX instructions cvta and isspacep on a GeForce GTX 470.

cvta is a PTX instruction for converting pointers in .global, .shared, and .local state spaces to generic addresses. Each of the segmented address spaces is mapped to non-overlapping regions in a single unified address space.

cvta.to. converts this generic pointer back to the indicated state space with undefined results if the pointer refers to the wrong state space. The instruction isspacep. sets a predicate register if a generic pointer belongs to a given address space, so PTX programs can query generic pointers before converting.

A simple test program illustrates my interpretation of these instructions (attached to this post, also available in http://code.google.com/p/gpuocelot/source/…a/test/driver):

generic.cpp [host program]

generic.ptx [PTX kernel]

This uses the CUDA driver API to load a PTX kernel from a file. The host program prints resulting predicate values from each permutation of:

{global ptr, local ptr, shared ptr} x {isspacep.global, isspacep.local, isspacep.shared}

When I run this program on a GeForce GTX470, I see several unusual results. Specifically, generic pointers to .shared test positive with isspacep.local and generic pointers to .local test positive with isspacep.shared. More disturbingly, .local pointers do not test positive with isspace.local and .shared pointers do not test positive with isspace.shared.

You should be able to compile this program using:

$ g++ generic.cpp -o GenericMemoryNative -lcuda

Output on my machine (Ubuntu 10.04 x64, CUDA 3.1, GeForce GTX470) with GenericMemoryNative:

$ ./GenericMemoryNative

%p1 - 0

%p2 - 0

%p4 - 1004

%p8 - 1008

Test: FAIL

$

The actual lines of PTX with the strange behavior are:

[codebox]mov.u64 %rd2, %rd1; // .global address

mov.u64 %rd3, $rs; // .shared address

mov.u64 %rd4, $rl; // .local address

cvta.global.u64 %rd2, %rd2;

cvta.shared.u64 %rd3, %rd3;

cvta.local.u64 %rd4, %rd4;

isspacep.global %p0, %rd2; // expect TRUE - program yields TRUE

isspacep.shared %p1, %rd3; // expect TRUE - program yields FALSE - ERROR??

isspacep.local %p2, %rd4; // expect TRUE - program yields FALSE - ERROR??

[/codebox]

Some of the other isspacep tests return true for queries in mismatched address spaces, but those could be correct if address spaces overlap [i.e. .local mapped to .shared or .global]. The above cases, though, return FALSE, and this might result in incorrect program behavior. By my interpretation, [font=“Courier New”]isspacep.shared %p1, %rd3;[/font] and [font=“Courier New”]isspacep.local %p2, %rd4;[/font] should return TRUE.

Does anyone see any problems with my interpretation of these instructions? Do you consider the test program to be valid?

Thanks for your help.
generic.zip (1.93 KB)

I’m experiencing some unusual behavior for PTX instructions cvta and isspacep on a GeForce GTX 470.

cvta is a PTX instruction for converting pointers in .global, .shared, and .local state spaces to generic addresses. Each of the segmented address spaces is mapped to non-overlapping regions in a single unified address space.

cvta.to. converts this generic pointer back to the indicated state space with undefined results if the pointer refers to the wrong state space. The instruction isspacep. sets a predicate register if a generic pointer belongs to a given address space, so PTX programs can query generic pointers before converting.

A simple test program illustrates my interpretation of these instructions (attached to this post, also available in http://code.google.com/p/gpuocelot/source/…a/test/driver):

generic.cpp [host program]

generic.ptx [PTX kernel]

This uses the CUDA driver API to load a PTX kernel from a file. The host program prints resulting predicate values from each permutation of:

{global ptr, local ptr, shared ptr} x {isspacep.global, isspacep.local, isspacep.shared}

When I run this program on a GeForce GTX470, I see several unusual results. Specifically, generic pointers to .shared test positive with isspacep.local and generic pointers to .local test positive with isspacep.shared. More disturbingly, .local pointers do not test positive with isspace.local and .shared pointers do not test positive with isspace.shared.

You should be able to compile this program using:

$ g++ generic.cpp -o GenericMemoryNative -lcuda

Output on my machine (Ubuntu 10.04 x64, CUDA 3.1, GeForce GTX470) with GenericMemoryNative:

$ ./GenericMemoryNative

%p1 - 0

%p2 - 0

%p4 - 1004

%p8 - 1008

Test: FAIL

$

The actual lines of PTX with the strange behavior are:

[codebox]mov.u64 %rd2, %rd1; // .global address

mov.u64 %rd3, $rs; // .shared address

mov.u64 %rd4, $rl; // .local address

cvta.global.u64 %rd2, %rd2;

cvta.shared.u64 %rd3, %rd3;

cvta.local.u64 %rd4, %rd4;

isspacep.global %p0, %rd2; // expect TRUE - program yields TRUE

isspacep.shared %p1, %rd3; // expect TRUE - program yields FALSE - ERROR??

isspacep.local %p2, %rd4; // expect TRUE - program yields FALSE - ERROR??

[/codebox]

Some of the other isspacep tests return true for queries in mismatched address spaces, but those could be correct if address spaces overlap [i.e. .local mapped to .shared or .global]. The above cases, though, return FALSE, and this might result in incorrect program behavior. By my interpretation, [font=“Courier New”]isspacep.shared %p1, %rd3;[/font] and [font=“Courier New”]isspacep.local %p2, %rd4;[/font] should return TRUE.

Does anyone see any problems with my interpretation of these instructions? Do you consider the test program to be valid?

Thanks for your help.

I think your code makes sense. There seems to be a problem in how isspacep is implemented. I reproduced the problem on my 470 GPU. I took your code, changed it to run on my 32-bit OS, simplified the example to do the ISSPACEP.SHARE test, and also added loads and stores to the share memory–just to be sure I see the actual registers holding the non-generic address being used. I then disassembled the elf.o code from PTXAS for the PTX file to understand what is going on.

For a shared memory address, CVTA.SHARED.U32 is implemented as a computed value, the non-generic address plus a constant in c0[0]. (In the example, the non-generic address is zero, which PTXAS knows, so it is actually not used explicitly.) CVTA.TO.SHARED.U32 is computed the other way around, generic address - c0[0]. ISSPACEP.SHARED is computed by masking the top nibble, and checking the result against the constant in c0[4]. Since the result is wrong, either the masking computation/compare is not the right implementation, or more likely, the constants are just not set up correctly for the block, probably because the GPU driver needs to be fixed. This seems like a bug.

[codebox]$ cat generic.ptx

    .version 2.1

    .target sm_20

.entry genericmemory (

.param .u32 __cudaparm_genericmemory_A)

{

.reg .u32 %r<9>;

.reg .u32 %rd<5>;

.reg .pred %p<9>;

.shared .u32 $rs;

.local .u32 $rl;

$BB_01:

ld.param.u32 %rd1, [__cudaparm_genericmemory_A];

mov.u32 %rd3, $rs; // move non-generic address of $rs to %rd3.

    st.shared.u32 [$rs], 1;			// Check actual access to the shared memory

    ld.shared.u32 %r8, [$rs];		// "

    st.shared.u32 [$rs], %r8;		// "

cvta.shared.u32 %rd3, %rd3; // Convert to generic address from shared.

    isspacep.shared %p1, %rd3;              // generic address is shared -- expect TRUE

    selp.u32 %r1, 1001, 0, %p1;

    st.global.u32 [%rd1+4], %r1;

$BB_02:

    exit;

}

$ nvdis elf.o

ELF File…

00000000: 2800440400005de4 mov b32 $r1 c1[0x100]

00000008: 1bfc000000001de2 mov b32 $r0 0xff000000

00000010: 1800000004011de2 mov b32 $r4 0x1 // load constant “1” for use later in the two “st” insts.

00000018: 280040008000dde4 mov b32 $r3 c0[0x20] // ld.param.u32 %rd1, [__cudaparm_genericmemory_A];

00000020: 6800400000009c03 and b32 $r2 $r0 c0[0] // ~cvta + ~isspacep

00000028: c900000003f11c85 st b32 s[$r63+0] $r4 // st.share.u32 [$rs], 1;

00000030: c100000003f01c85 ld b32 $r0 s[$r63+0] // ld.shared.u32 %r8, [$rs];

00000038: 190e40001021dc03 set $p0 eq u32 $r2 c0[0x4] // isspacep.shared %p1, %rd3;

00000040: 4800c00040105d03 sub b32 $r1 $r1 0x10 // register $r1 not used!

00000048: 4800c0001030dc03 add b32 $r3 $r3 0x4 // compute %rd1+4

00000050: 2010c00fa7f09c04 selp b32 $r2 $r63 0x3e9 not $p0 // selp.u32 %r1, 1001, 0, %p1;

00000058: c900000003f01c85 st b32 s[$r63+0] $r0 // st.share.u32 [$rs], %r8;

00000060: 9000000000309c85 st b32 wb g[$r3+0] $r2 // st.global.u32 [%rd1+4], %r1;

00000068: 8000000000001de7 exit

[/codebox]

I think your code makes sense. There seems to be a problem in how isspacep is implemented. I reproduced the problem on my 470 GPU. I took your code, changed it to run on my 32-bit OS, simplified the example to do the ISSPACEP.SHARE test, and also added loads and stores to the share memory–just to be sure I see the actual registers holding the non-generic address being used. I then disassembled the elf.o code from PTXAS for the PTX file to understand what is going on.

For a shared memory address, CVTA.SHARED.U32 is implemented as a computed value, the non-generic address plus a constant in c0[0]. (In the example, the non-generic address is zero, which PTXAS knows, so it is actually not used explicitly.) CVTA.TO.SHARED.U32 is computed the other way around, generic address - c0[0]. ISSPACEP.SHARED is computed by masking the top nibble, and checking the result against the constant in c0[4]. Since the result is wrong, either the masking computation/compare is not the right implementation, or more likely, the constants are just not set up correctly for the block, probably because the GPU driver needs to be fixed. This seems like a bug.

[codebox]$ cat generic.ptx

    .version 2.1

    .target sm_20

.entry genericmemory (

.param .u32 __cudaparm_genericmemory_A)

{

.reg .u32 %r<9>;

.reg .u32 %rd<5>;

.reg .pred %p<9>;

.shared .u32 $rs;

.local .u32 $rl;

$BB_01:

ld.param.u32 %rd1, [__cudaparm_genericmemory_A];

mov.u32 %rd3, $rs; // move non-generic address of $rs to %rd3.

    st.shared.u32 [$rs], 1;			// Check actual access to the shared memory

    ld.shared.u32 %r8, [$rs];		// "

    st.shared.u32 [$rs], %r8;		// "

cvta.shared.u32 %rd3, %rd3; // Convert to generic address from shared.

    isspacep.shared %p1, %rd3;              // generic address is shared -- expect TRUE

    selp.u32 %r1, 1001, 0, %p1;

    st.global.u32 [%rd1+4], %r1;

$BB_02:

    exit;

}

$ nvdis elf.o

ELF File…

00000000: 2800440400005de4 mov b32 $r1 c1[0x100]

00000008: 1bfc000000001de2 mov b32 $r0 0xff000000

00000010: 1800000004011de2 mov b32 $r4 0x1 // load constant “1” for use later in the two “st” insts.

00000018: 280040008000dde4 mov b32 $r3 c0[0x20] // ld.param.u32 %rd1, [__cudaparm_genericmemory_A];

00000020: 6800400000009c03 and b32 $r2 $r0 c0[0] // ~cvta + ~isspacep

00000028: c900000003f11c85 st b32 s[$r63+0] $r4 // st.share.u32 [$rs], 1;

00000030: c100000003f01c85 ld b32 $r0 s[$r63+0] // ld.shared.u32 %r8, [$rs];

00000038: 190e40001021dc03 set $p0 eq u32 $r2 c0[0x4] // isspacep.shared %p1, %rd3;

00000040: 4800c00040105d03 sub b32 $r1 $r1 0x10 // register $r1 not used!

00000048: 4800c0001030dc03 add b32 $r3 $r3 0x4 // compute %rd1+4

00000050: 2010c00fa7f09c04 selp b32 $r2 $r63 0x3e9 not $p0 // selp.u32 %r1, 1001, 0, %p1;

00000058: c900000003f01c85 st b32 s[$r63+0] $r0 // st.share.u32 [$rs], %r8;

00000060: 9000000000309c85 st b32 wb g[$r3+0] $r2 // st.global.u32 [%rd1+4], %r1;

00000068: 8000000000001de7 exit

[/codebox]