string constants truncated - nvcc bug? every 10th truncated, starting with 8th

Carsten_Scholtes · March 17, 2009, 11:23am

In nvcc, there appears to be a problem with the handling of more than about 8 string constants. The following program demonstates this:

[codebox]

#include<stdio.h>

#define CUDA_PRINT_BS 1000

#define N CUDA_PRINT_BS

device int pos;

device char *buf;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread, output stops early? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_txt(char *t) {

cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                          /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaMemcpy(got, out, N * sizeof(char), cudaMemcpyDeviceToHost);

got[N] = '\0';

puts(got);

cudaFree(out);

return 0;

}

[/codebox]

It can be translated with “nvcc -arch sm_13”. For simplicity, it utilizes only a single cuda thread. The first function copies a char to a buffer on the device and advances a corresponding buffer index. The second function uses the first to output strings to the buffer. The global function uses the second function to output multiple string constants.

Running the program in device-emulation mode produces the expected results. Valgrind has no complaints (except some “Conditional jump or move depends on uninitialised value(s)” within “/lib/ld-2.5.so”). Running it on the device produces the following output:

[codebox]

L ABC1

A ABC2

F ABC3

K ABC4

P ABC5

E ABC6

J ABC7

O A

D ABC9

I ABC0

N ABc1

C ABc2

H ABc3

M ABc4

B ABc5

G ABc6

L ABc7

A A

F ABc9

K Abc0

P Abc1

E Abc2

J Abc3

O Abc4

D Abc5

I Abc6

O A

N Abc7

C A

H Abc9

K Abc0

O A

A A

M abc8

B aBc0

G aBc9

L aBc8

A aBc7

F aBc6

K aBc5

P aBc4

E a

J aBc2

O aBc1

D ab

G ab51

L abz2

A abc3e

A abc4trzt

G abr55c1

O abe6r61

G abr7wc7

J a

O abz9wc1

C abe0rwc1

[/codebox]

[*] Starting with the 8th String constant (Which should be “ABC8”) every 10th new constant is truncated. (As I suspected alignment issues earlier, there is a letter preceeding each line of output. It reflects the last 4 Bits of the address of the string shown with ‘A’ meaning 0, …, ‘P’ meaning 15.)

[*] Specifying a truncated constant again (“ABC8” again after “Abc6”) reuses the truncated version.

[*] Looking at the cubin, constants appear to be allocated in chunks of 10.

[*] The problem may be related to incorrect lengths of strings. Here, the allocations for "ABC7 and “ABC6”:

consts {

				name	= __constant809

				segname = const

				segnum  = 0

				offset  = 57

				bytes   = 8

		mem {

				0x37434241 0x00000000

		}

}

consts {

				name	= __constant808

				segname = const

				segnum  = 0

				offset  = 52

				bytes   = 8

		mem {

				0x36434241 0x00000000

		}

}

Allthough offsets are only separated by 5, lengths are given with 8. Depending on the order of actual initialization, truncations might well be the result…

Is this a nvcc problem or am I doing something wrong?

(On a side node, I’d like to know why the atomicAdd could not be replaced with the line before it without changing program behaviour if only one thread is started.)

Experiments were conducted on a 4 way Dual Core AMD Opteron 270 (1GHz) system running openSUSE 11.1 64-Bit, CUDA 2.1:

CUDA hardware:

Card: nvidia GeForce GTX 280

Architecture: GT200 A2

PCI id: 0x5e1

GPU clock: 399.600 MHz

Bustype: PCI-Express

– Shader info –

Clock: 1296.000 MHz

Stream units: 240 (11111111b)

ROP units: 32 (11111111b)

– Memory info –

Amount: 1024 MB

Type: 512 bit DDR3

Clock: 297.000 MHz

– PCI-Express info –

Current Rate: 16X

Maximum rate: 16X

SPWorley · March 17, 2009, 11:48am

The mistake is that in your kernel, you’re passing a HOST pointer to a device function.
The host pointer is the one that’s defined by your string literal… when compiled, “hello” just becomes a const char * pointer to some fixed memory inside the program binary. That’s defined on the host only.

But, you ask, then why did some strings work?
The answer is likely the compiler is smart enough to unroll some loops and realize that “hello”[0] is equal to the constant ‘h’.
So it unrolls some of the computes, making code that does the right thing. Remaining code still accesses via pointer… but that pointer is garbage since it’s defined on the host.

So how do you define a constant character pointer on the device? You don’t. malloc it and pass it in as a kernel argument is the right answer, but that’s obviously quite unsatisfying to the convenience of just typing “hello”.

Sarnath · March 17, 2009, 11:58am

@SPWorley,

I dont know what you said is true or not. But looks like it is a GREAT catch!

How about

__device__ char helloworld[] = "Hello World";

Will this not work?

Carsten_Scholtes · March 17, 2009, 3:16pm

I don’t think this is a host pointer issue. Since the string constants are introduced within a device / global function, the compiler ought to know that device addresses were needed. Below you find an enhanced version of the demonstrator showing complete addresses of the strings to be printed. Additionally, a device “global” string, as kindly proposed by Sarnath, is printed. Also, the address of a string constant of the host is printed for comparison.

[codebox]

#include<stdio.h>

#define CUDA_PRINT_BS 1000

#define N CUDA_PRINT_BS

device int pos;

device char *buf;

device char helloworld = “Hello World”;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_pointer(void *p) {

int i;

int z;

int nz;

cuda_print_char('0');

cuda_print_char('x');

nz = 0;

for(i = 2 * sizeof(void *) - 1; i >= 0; --i) {

	z = (((char *)p - (char *)0)>>(4 * i)) & 15;

	if(z != 0) {nz = 1;}

	if(0 != nz) {

		if(z < 10) {

			cuda_print_char('0' + z);

		} else {

			cuda_print_char('a' - 10 + z);

		}

	}

}

if(0 == nz) {

	cuda_print_char('0');

}

}

device void cuda_print_txt(char *t) {

cuda_print_pointer(t);

//cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                            /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

cuda_print_txt(helloworld);

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaMemcpy(got, out, N * sizeof(char), cudaMemcpyDeviceToHost);

got[N] = '\0';

puts(got);

cudaFree(out);

printf("Hostpointer: %p\n", "dummy string constant");

return 0;

}

[/codebox]

This is the output of the program:

[codebox]

0x1b ABC1

0x20 ABC2

0x25 ABC3

0x2a ABC4

0x2f ABC5

0x34 A

0x39 ABC7

0x3e ABC8

0x43 ABC9

0x48 ABC0

0x4d ABc1

0x52 ABc2

0x57 ABc3

0x5c ABc4

0x61 ABc5

0x66 A

0x6b ABc7

0x70 ABc8

0x75 ABc9

0x7a Abc0

0x7f Abc1

0x84 Abc2

0x89 Abc3

0x8e Abc4

0x93 Abc5

0x98 A

0x3e ABC8

0x9d Abc7

0xa2 Abc8

0xa7 Abc9

0x7a Abc0

0x3e ABC8

0x70 ABc8

0xac abc8

0xb1 aBc0

0xb6 aBc9

0xbb aBc8

0xc0 aBc7

0xc5 aBc6

0xca a

0xcf aBc4

0xd4 aBc3

0xd9 aBc2

0xde aBc1

0xe3 ab

0xe6 ab51

0xeb abz2

0xf0 abc3e

0x0 abc4trzt

0xf6 abr55c1

0xfe abe6r61

0x106 abr7wc7

0x9 a

0x10e abz9wc1

0x12 abe0rwc1

0x1000800 Hello World

Hostpointer: 0x401068

[/codebox]

The three types of addresses are clearly distinguishable. The string constants as well as the device global string can be found within the related cubin file hinting at their intended presence within the cuda kernel on the device.

Comparing this output with the output of the original demonstrator, a few more hints on the nature of the problem can be observed:

[*] The first truncation has shifted from “ABC8” to “ABC6”. This may be due to the additional allocations for the device global string. Still, the distance between the errors remains at 10 string constants.

[*] Furthermore, the addresses of the strings appear to be distributed in a longest string first fashion. Still, the error occurs after the introduction of the 10th new string constant, independent from the the lengths of these constants. (Curiously, 0x0 appears to be a valid and used device address which might break some codes!)

Carsten_Scholtes · March 18, 2009, 6:31am

I have to correct myself:

Looking more closely at the outputs, the trucations occur not that regularly. Distances in these examples vary from about 9 new string constants to
about 12 new string constants between truncations.

Perhaps distances might be prolonged by specifying strings where “offset” difference and “bytes” are congruent, e.g. 7 character strings like “abr55c1”.

Sarnath · March 18, 2009, 6:58am

“pos” is not initialized to 0. Are such device variables initializd to 0 by default?

Also, it would be better if you could “zero” out the “out” array as well … Beacuse it might result in some junk to be printed as well.

Carsten_Scholtes · March 18, 2009, 7:25am

“pos” is initialized to 0 as the first action in the global function. Since I’m using only a single cuda thread I thought that should do.

Yes, junk might follow the intended output. Here is a cleaner version:

[codebox]

#include<stdio.h>

define CUDA_PRINT_BS 1000

define N CUDA_PRINT_BS

device int pos;

device char *buf;

device char helloworld = “Hello World”;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_pointer(void *p) {

int i;

int z;

int nz;

cuda_print_char('0');

cuda_print_char('x');

nz = 0;

for(i = 2 * sizeof(void *) - 1; i >= 0; --i) {

	z = (((char *)p - (char *)0)>>(4 * i)) & 15;

	if(z != 0) {nz = 1;}

	if(0 != nz) {

		if(z < 10) {

			cuda_print_char('0' + z);

		} else {

			cuda_print_char('a' - 10 + z);

		}

	}

}

if(0 == nz) {

	cuda_print_char('0');

}

}

device void cuda_print_txt(char *t) {

cuda_print_pointer(t);

//cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                            /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

cuda_print_txt(helloworld);

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

int *dpos;

int len;

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaGetSymbolAddress((void **)&dpos, "pos");

cudaMemcpy(&len, dpos, sizeof(int), cudaMemcpyDeviceToHost);

if((len < 0) || (len > N)) len = N;

cudaMemcpy(got, out, len * sizeof(char), cudaMemcpyDeviceToHost);

got[len] = '\0';

puts(got);

cudaFree(out);

printf("Hostpointer: %p\n", "dummy string constant");

return 0;

}

[/codebox]

The main problem remains: string constants are truncated.

Jamie_K · March 18, 2009, 4:00pm

Something strange is going on, with the same code I’m getting similar (but not identical) behavior on my 9800 GTX, and the compiler takes a very long time to compile the program.

I tried a couple things and I can’t explain the behavior. Something seems wrong here.

One thing that may be related, I wrote what should be equivalent code, which produces the same strange behavior, and in the process I got a warning: “Advisory: Cannot tell what pointer points to, assuming global memory space” Might be related, I’m not sure.

Edit: a little more investigation, it looks like it’s not related to synchronization or accessing the wrong kind of memory. I added code to print the length of the string and it looks like the strings are indeed not getting initialized correctly as Carsten suggested in the OP. Looks like a bug to me.

Jamie_K · March 18, 2009, 5:26pm

This works (and is more or less what the compiler would be expected to do anyway):

char *chardata = "ABC1