string constants truncated - nvcc bug? every 10th truncated, starting with 8th

In nvcc, there appears to be a problem with the handling of more than about 8 string constants. The following program demonstates this:

[codebox]

#include<stdio.h>

#define CUDA_PRINT_BS 1000

#define N CUDA_PRINT_BS

device int pos;

device char *buf;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread, output stops early? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_txt(char *t) {

cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                          /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaMemcpy(got, out, N * sizeof(char), cudaMemcpyDeviceToHost);

got[N] = '\0';

puts(got);

cudaFree(out);

return 0;

}

[/codebox]

It can be translated with “nvcc -arch sm_13”. For simplicity, it utilizes only a single cuda thread. The first function copies a char to a buffer on the device and advances a corresponding buffer index. The second function uses the first to output strings to the buffer. The global function uses the second function to output multiple string constants.

Running the program in device-emulation mode produces the expected results. Valgrind has no complaints (except some “Conditional jump or move depends on uninitialised value(s)” within “/lib/ld-2.5.so”). Running it on the device produces the following output:

[codebox]

L ABC1

A ABC2

F ABC3

K ABC4

P ABC5

E ABC6

J ABC7

O A

D ABC9

I ABC0

N ABc1

C ABc2

H ABc3

M ABc4

B ABc5

G ABc6

L ABc7

A A

F ABc9

K Abc0

P Abc1

E Abc2

J Abc3

O Abc4

D Abc5

I Abc6

O A

N Abc7

C A

H Abc9

K Abc0

O A

A A

M abc8

B aBc0

G aBc9

L aBc8

A aBc7

F aBc6

K aBc5

P aBc4

E a

J aBc2

O aBc1

D ab

G ab51

L abz2

A abc3e

A abc4trzt

G abr55c1

O abe6r61

G abr7wc7

J a

O abz9wc1

C abe0rwc1

[/codebox]

    [*] Starting with the 8th String constant (Which should be “ABC8”) every 10th new constant is truncated. (As I suspected alignment issues earlier, there is a letter preceeding each line of output. It reflects the last 4 Bits of the address of the string shown with ‘A’ meaning 0, …, ‘P’ meaning 15.)

    [*] Specifying a truncated constant again (“ABC8” again after “Abc6”) reuses the truncated version.

    [*] Looking at the cubin, constants appear to be allocated in chunks of 10.

    [*] The problem may be related to incorrect lengths of strings. Here, the allocations for "ABC7 and “ABC6”:

consts {

				name	= __constant809

				segname = const

				segnum  = 0

				offset  = 57

				bytes   = 8

		mem {

				0x37434241 0x00000000

		}

}

consts {

				name	= __constant808

				segname = const

				segnum  = 0

				offset  = 52

				bytes   = 8

		mem {

				0x36434241 0x00000000

		}

}

Allthough offsets are only separated by 5, lengths are given with 8. Depending on the order of actual initialization, truncations might well be the result…

Is this a nvcc problem or am I doing something wrong?

(On a side node, I’d like to know why the atomicAdd could not be replaced with the line before it without changing program behaviour if only one thread is started.)

Experiments were conducted on a 4 way Dual Core AMD Opteron 270 (1GHz) system running openSUSE 11.1 64-Bit, CUDA 2.1:

CUDA hardware:

Card: nvidia GeForce GTX 280

Architecture: GT200 A2

PCI id: 0x5e1

GPU clock: 399.600 MHz

Bustype: PCI-Express

– Shader info –

Clock: 1296.000 MHz

Stream units: 240 (11111111b)

ROP units: 32 (11111111b)

– Memory info –

Amount: 1024 MB

Type: 512 bit DDR3

Clock: 297.000 MHz

– PCI-Express info –

Current Rate: 16X

Maximum rate: 16X

The mistake is that in your kernel, you’re passing a HOST pointer to a device function.
The host pointer is the one that’s defined by your string literal… when compiled, “hello” just becomes a const char * pointer to some fixed memory inside the program binary. That’s defined on the host only.

But, you ask, then why did some strings work?
The answer is likely the compiler is smart enough to unroll some loops and realize that “hello”[0] is equal to the constant ‘h’.
So it unrolls some of the computes, making code that does the right thing. Remaining code still accesses via pointer… but that pointer is garbage since it’s defined on the host.

So how do you define a constant character pointer on the device? You don’t. malloc it and pass it in as a kernel argument is the right answer, but that’s obviously quite unsatisfying to the convenience of just typing “hello”.

@SPWorley,

I dont know what you said is true or not. But looks like it is a GREAT catch!

How about

__device__ char helloworld[] = "Hello World";

Will this not work?

I don’t think this is a host pointer issue. Since the string constants are introduced within a device / global function, the compiler ought to know that device addresses were needed. Below you find an enhanced version of the demonstrator showing complete addresses of the strings to be printed. Additionally, a device “global” string, as kindly proposed by Sarnath, is printed. Also, the address of a string constant of the host is printed for comparison.

[codebox]

#include<stdio.h>

#define CUDA_PRINT_BS 1000

#define N CUDA_PRINT_BS

device int pos;

device char *buf;

device char helloworld = “Hello World”;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_pointer(void *p) {

int i;

int z;

int nz;

cuda_print_char('0');

cuda_print_char('x');

nz = 0;

for(i = 2 * sizeof(void *) - 1; i >= 0; --i) {

	z = (((char *)p - (char *)0)>>(4 * i)) & 15;

	if(z != 0) {nz = 1;}

	if(0 != nz) {

		if(z < 10) {

			cuda_print_char('0' + z);

		} else {

			cuda_print_char('a' - 10 + z);

		}

	}

}

if(0 == nz) {

	cuda_print_char('0');

}

}

device void cuda_print_txt(char *t) {

cuda_print_pointer(t);

//cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                            /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

cuda_print_txt(helloworld);

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaMemcpy(got, out, N * sizeof(char), cudaMemcpyDeviceToHost);

got[N] = '\0';

puts(got);

cudaFree(out);

printf("Hostpointer: %p\n", "dummy string constant");

return 0;

}

[/codebox]

This is the output of the program:

[codebox]

0x1b ABC1

0x20 ABC2

0x25 ABC3

0x2a ABC4

0x2f ABC5

0x34 A

0x39 ABC7

0x3e ABC8

0x43 ABC9

0x48 ABC0

0x4d ABc1

0x52 ABc2

0x57 ABc3

0x5c ABc4

0x61 ABc5

0x66 A

0x6b ABc7

0x70 ABc8

0x75 ABc9

0x7a Abc0

0x7f Abc1

0x84 Abc2

0x89 Abc3

0x8e Abc4

0x93 Abc5

0x98 A

0x3e ABC8

0x9d Abc7

0xa2 Abc8

0xa7 Abc9

0x7a Abc0

0x3e ABC8

0x70 ABc8

0xac abc8

0xb1 aBc0

0xb6 aBc9

0xbb aBc8

0xc0 aBc7

0xc5 aBc6

0xca a

0xcf aBc4

0xd4 aBc3

0xd9 aBc2

0xde aBc1

0xe3 ab

0xe6 ab51

0xeb abz2

0xf0 abc3e

0x0 abc4trzt

0xf6 abr55c1

0xfe abe6r61

0x106 abr7wc7

0x9 a

0x10e abz9wc1

0x12 abe0rwc1

0x1000800 Hello World

Hostpointer: 0x401068

[/codebox]

The three types of addresses are clearly distinguishable. The string constants as well as the device global string can be found within the related cubin file hinting at their intended presence within the cuda kernel on the device.

Comparing this output with the output of the original demonstrator, a few more hints on the nature of the problem can be observed:

    [*] The first truncation has shifted from “ABC8” to “ABC6”. This may be due to the additional allocations for the device global string. Still, the distance between the errors remains at 10 string constants.

    [*] Furthermore, the addresses of the strings appear to be distributed in a longest string first fashion. Still, the error occurs after the introduction of the 10th new string constant, independent from the the lengths of these constants. (Curiously, 0x0 appears to be a valid and used device address which might break some codes!)

I have to correct myself:

Looking more closely at the outputs, the trucations occur not that regularly. Distances in these examples vary from about 9 new string constants to
about 12 new string constants between truncations.

Perhaps distances might be prolonged by specifying strings where “offset” difference and “bytes” are congruent, e.g. 7 character strings like “abr55c1”.

“pos” is not initialized to 0. Are such device variables initializd to 0 by default?

Also, it would be better if you could “zero” out the “out” array as well … Beacuse it might result in some junk to be printed as well.

“pos” is initialized to 0 as the first action in the global function. Since I’m using only a single cuda thread I thought that should do.

Yes, junk might follow the intended output. Here is a cleaner version:

[codebox]

#include<stdio.h>

define CUDA_PRINT_BS 1000

define N CUDA_PRINT_BS

device int pos;

device char *buf;

device char helloworld = “Hello World”;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_pointer(void *p) {

int i;

int z;

int nz;

cuda_print_char('0');

cuda_print_char('x');

nz = 0;

for(i = 2 * sizeof(void *) - 1; i >= 0; --i) {

	z = (((char *)p - (char *)0)>>(4 * i)) & 15;

	if(z != 0) {nz = 1;}

	if(0 != nz) {

		if(z < 10) {

			cuda_print_char('0' + z);

		} else {

			cuda_print_char('a' - 10 + z);

		}

	}

}

if(0 == nz) {

	cuda_print_char('0');

}

}

device void cuda_print_txt(char *t) {

cuda_print_pointer(t);

//cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                            /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

cuda_print_txt(helloworld);

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

int *dpos;

int len;

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

cudaThreadSynchronize(); /* wait until no earlier requests left */

cudaGetSymbolAddress((void **)&dpos, "pos");

cudaMemcpy(&len, dpos, sizeof(int), cudaMemcpyDeviceToHost);

if((len < 0) || (len > N)) len = N;

cudaMemcpy(got, out, len * sizeof(char), cudaMemcpyDeviceToHost);

got[len] = '\0';

puts(got);

cudaFree(out);

printf("Hostpointer: %p\n", "dummy string constant");

return 0;

}

[/codebox]

The main problem remains: string constants are truncated.

Something strange is going on, with the same code I’m getting similar (but not identical) behavior on my 9800 GTX, and the compiler takes a very long time to compile the program.

I tried a couple things and I can’t explain the behavior. Something seems wrong here.

One thing that may be related, I wrote what should be equivalent code, which produces the same strange behavior, and in the process I got a warning: “Advisory: Cannot tell what pointer points to, assuming global memory space” Might be related, I’m not sure.

Edit: a little more investigation, it looks like it’s not related to synchronization or accessing the wrong kind of memory. I added code to print the length of the string and it looks like the strings are indeed not getting initialized correctly as Carsten suggested in the OP. Looks like a bug to me.

This works (and is more or less what the compiler would be expected to do anyway):

char *chardata = "ABC1

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC2

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC3

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC4

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC5

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC6

char *chardata = “ABC1\0ABC2\0ABC3\0ABC4\0ABC5\0ABC6\0ABC7”;

cuda_print_txt(chardata);

cuda_print_txt(chardata+5);

cuda_print_txt(chardata+10);

cuda_print_txt(chardata+15);

cuda_print_txt(chardata+20);

cuda_print_txt(chardata+25);

cuda_print_txt(chardata+30);
ABC7";

	cuda_print_txt(chardata);

	cuda_print_txt(chardata+5);

	cuda_print_txt(chardata+10);

	cuda_print_txt(chardata+15);

	cuda_print_txt(chardata+20);

	cuda_print_txt(chardata+25);

	cuda_print_txt(chardata+30);

Appending null characters sometimes makes the problem go away, (i haven’t fully characterized when or why):

cuda_print_txt("ABC1

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC2

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC3

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC4

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC5

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC6

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC7

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC8

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC9

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

	cuda_print_txt("ABC0

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");

cuda_print_txt(“ABC1\0\0”);

cuda_print_txt("ABC2\0\0");

cuda_print_txt("ABC3\0\0");

cuda_print_txt("ABC4\0\0");

cuda_print_txt("ABC5\0\0");

cuda_print_txt("ABC6\0\0");

cuda_print_txt("ABC7\0\0");

cuda_print_txt("ABC8\0\0");

cuda_print_txt("ABC9\0\0");

cuda_print_txt("ABC0\0\0");
");

Also, sometimes a null character appears in the output, despite the while(*t != ‘\0’) condition. I suspect this is because the optimizer “proves” that there will be at least one iteration due to the constant input, but when the constant gets mangled to a zero-length string instead, it writes the null character to the buffer.

Tim,
Any comments? 2 people have reported the same behaviour and feel it is a bug. THanks.

Hello… Any comments from NVIDIA?

Some new insights:

Inspired by the experiments of Jamie K (thank you for helping), I added a new global function “grab” that reads out the memory where the string constants are stored. (The start address is derived from the string showing the smallest address (0x0) in the call of “work”.) The contents are transferred to the host and displayed by “fhexdump”. This is the resulting additional output:

[codebox]

Dump of constant memory:

(nil): 61626334 74727a74 00000000 38343363 abc4 trzt … 843c

0x10: 38006162 65307277 63310041 42433100   8.ab e0rw c1.A BC1.

0x20: 41424332 00414243 33004142 43340041   ABC2 .ABC 3.AB C4.A

0x30: 42433500 00000036 00414243 37004142   BC5. ...6 .ABC 7.AB

0x40: 43380041 42433900 41424330 00414263   C8.A BC9. ABC0 .ABc

0x50: 31004142 63320041 42633300 41426334   1.AB c2.A Bc3. ABc4

0x60: 00414263 35000000 00360041 42633700   .ABc 5... .6.A Bc7.

0x70: 41426338 00414263 39004162 63300041   ABc8 .ABc 9.Ab c0.A

0x80: 62633100 41626332 00416263 33004162   bc1. Abc2 .Abc 3.Ab

0x90: 63340041 62633500 00000036 00416263   c4.A bc5. ...6 .Abc

0xa0: 37004162 63380041 62633900 61626338   7.Ab c8.A bc9. abc8

0xb0: 00614263 30006142 63390061 42633800   .aBc 0.aB c9.a Bc8.

0xc0: 61426337 00614263 36000000 00350061   aBc7 .aBc 6... .5.a

0xd0: 42633400 61426333 00614263 32006142   Bc4. aBc3 .aBc 2.aB

0xe0: 63310061 62006162 35310061 627a3200   c1.a b.ab 51.a bz2.

0xf0: 61626333 65006162 72353563 31006162   abc3 e.ab r55c 1.ab

0x100: 65367236 31006162 72377763 37006162 e6r6 1.ab r7wc 7.ab

0x110: 7a397763 31000000 00000000 00000000 z9wc 1… … …

[/codebox]

Here is the preceding output of the same call for reference:

[codebox]

0x1b ABC1

0x20 ABC2

0x25 ABC3

0x2a ABC4

0x2f ABC5

0x34 A

0x39 ABC7

0x3e ABC8

0x43 ABC9

0x48 ABC0

0x4d ABc1

0x52 ABc2

0x57 ABc3

0x5c ABc4

0x61 ABc5

0x66 A

0x6b ABc7

0x70 ABc8

0x75 ABc9

0x7a Abc0

0x7f Abc1

0x84 Abc2

0x89 Abc3

0x8e Abc4

0x93 Abc5

0x98 A

0x3e ABC8

0x9d Abc7

0xa2 Abc8

0xa7 Abc9

0x7a Abc0

0x3e ABC8

0x70 ABc8

0xac abc8

0xb1 aBc0

0xb6 aBc9

0xbb aBc8

0xc0 aBc7

0xc5 aBc6

0xca a

0xcf aBc4

0xd4 aBc3

0xd9 aBc2

0xde aBc1

0xe3 ab

0xe6 ab51

0xeb abz2

0xf0 abc3e

0x0 abc4trzt

0xf6 abr55c1

0xfe abe6r61

0x106 abr7wc7

0x9 a

0x10e abz9wc1

0x12 abe0rwc1

0x1000800 Hello World

Hostpointer: 0x413f80

[/codebox]

Looking at the dump, one can see that some of the constants are incomplete:

    [*] At 0x8, 0x33, 0x65, 0x97 and 0xc9, there are 4 Bytes 0x00 instead of only one.

    [*] Probably, these bytes overwrote the strings that were expected to reside in these places: The missing characters would fit in there exactly.

    [*] The amount of overwriting is consistent with the excerpt from the cubin in my first post. Apparently, the attrribute “bytes” of the “consts” definitions specifies the amount of memory to be initialized. These amounts appear to be rounded up to 4-Byte words. Still, the attribute “offset” is exact to the char. This way, consts definitions overlap by up the 3 bytes.

    [*] Initialization should probably be done in the order of increasing offset. Apparently, this is not always the case.

A further observation is that some kind of inlining / unrolling, as suggested by SPWorley, might have taken place: Although constant memory contains 0x00 at 0x34, the call of “work” delivers the initial ‘A’ actually expected at that address.

Here is the code generating the output above.

[codebox]

#include<stdio.h>

#include<ctype.h>

#define CUDA_PRINT_BS 1000

#define N CUDA_PRINT_BS

#define CUDACHECK() {\

cudaError_t e = cudaGetLastError();\

if(cudaSuccess != e) {\

	printf("Cuda error in file \"%s\" in line %i:\n%s\n",\

	       __FILE__, __LINE__, cudaGetErrorString(e));\

}\

}

device int pos;

device char *buf;

device char helloworld = “Hello World”;

device void cuda_print_char(char c) {

    int p;

//p = pos; ++pos; /* why not, with only 1 thread? */

    p = atomicAdd(&pos, 1);

//__syncthreads(); /* no change */

    if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;

}

device void cuda_print_pointer(void *p) {

int i;

int z;

size_t d;

cuda_print_char('0');

cuda_print_char('x');

d = ((char *)p - (char *)0);

for(i = 4 * (2 * sizeof(void *) - 1);

    ((d >> i) == 0) && (i > 0);

    i -= 4

);

for( ; i >= 0; i -= 4) {

	z = (d >> i) & 15;

	cuda_print_char('0' + z - (z >= 10) * ('0' - 'a' + 10));

}

}

device void cuda_print_txt(char *t) {

cuda_print_pointer(t);

//cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence  */

cuda_print_char(' ');                            /* on address (?) */

    while('\0' != *t) {

            cuda_print_char(*t);

            ++t;

    }

cuda_print_char('\n');

}

global void work(char *out){

pos = 0;

buf = out;

__syncthreads();

cuda_print_txt("ABC1");

cuda_print_txt("ABC2");

cuda_print_txt("ABC3");

cuda_print_txt("ABC4");

cuda_print_txt("ABC5");

cuda_print_txt("ABC6");

cuda_print_txt("ABC7");

cuda_print_txt("ABC8");

cuda_print_txt("ABC9");

cuda_print_txt("ABC0");

cuda_print_txt("ABc1");

cuda_print_txt("ABc2");

cuda_print_txt("ABc3");

cuda_print_txt("ABc4");

cuda_print_txt("ABc5");

cuda_print_txt("ABc6");

cuda_print_txt("ABc7");

cuda_print_txt("ABc8");

cuda_print_txt("ABc9");

cuda_print_txt("Abc0");

cuda_print_txt("Abc1");

cuda_print_txt("Abc2");

cuda_print_txt("Abc3");

cuda_print_txt("Abc4");

cuda_print_txt("Abc5");

cuda_print_txt("Abc6");

cuda_print_txt("ABC8");

cuda_print_txt("Abc7");

cuda_print_txt("Abc8");

cuda_print_txt("Abc9");

cuda_print_txt("Abc0");

cuda_print_txt("ABC8");

cuda_print_txt("ABc8");

cuda_print_txt("abc8");

cuda_print_txt("aBc0");

cuda_print_txt("aBc9");

cuda_print_txt("aBc8");

cuda_print_txt("aBc7");

cuda_print_txt("aBc6");

cuda_print_txt("aBc5");

cuda_print_txt("aBc4");

cuda_print_txt("aBc3");

cuda_print_txt("aBc2");

cuda_print_txt("aBc1");

cuda_print_txt("ab");

cuda_print_txt("ab51");

cuda_print_txt("abz2");

cuda_print_txt("abc3e");

cuda_print_txt("abc4trzt");

cuda_print_txt("abr55c1");

cuda_print_txt("abe6r61");

cuda_print_txt("abr7wc7");

cuda_print_txt("abw843c8");

cuda_print_txt("abz9wc1");

cuda_print_txt("abe0rwc1");

cuda_print_txt(helloworld);

}

global void grab(char *out, size_t len) {

char *sc = "abc4trzt"; /* the string constant found at 0x0 */

for(int i = 0; i < len; ++i) out[i] = sc[i];

}

void fhexdump(FILE *fp, char *cm, int len, char *p) {

    int i;

    char *hd = "0123456789abcdef";

while(len > 0) {

            fprintf(fp, "%8p: ", p);

            for(i = 0; i < 16; ++i) {

                    if(i < len) {

                            fprintf(fp, "%c%c", hd[(cm[i] >> 4) & 15],

                                                 hd[cm[i] & 15]);

                    } else {

                            fprintf(fp, "  ");

                    }

                    if(3 == i % 4) {

                            fprintf(fp, " ");

                    }

            }

            fprintf(fp, "  ");

            for(i = 0; i < 16; ++i) {

                    if(i < len) {

                            if(isprint(cm[i])) {

                                    fprintf(fp, "%c", cm[i]);

                            } else {

                                    fprintf(fp, "%c", '.');

                            }

                    } else {

                            fprintf(fp, "%c", ' ');

                    }

                    if(3 == i % 4) {

                            fprintf(fp, " ");

                    }

            }

            fprintf(fp, "%c", '\n');

            if(len >= 16) {

                    len -= 16;

            } else {

                    len = 0;

            }

            cm += 16;

            p += 16;

    }

}

int main(int argc, char *argv) {

char *out;

char got[N + 1];

int *dpos;

int len;

cudaSetDevice(0);

cudaMalloc((void**)&out, N * sizeof(char));

work<<<1, 1>>>(out);

CUDACHECK();

cudaThreadSynchronize(); /* wait until no earlier requests left */

CUDACHECK();

cudaGetSymbolAddress((void **)&dpos, "pos");

CUDACHECK();

cudaMemcpy(&len, dpos, sizeof(int), cudaMemcpyDeviceToHost);

CUDACHECK();

if((len < 0) || (len > N)) len = N;

cudaMemcpy(got, out, len * sizeof(char), cudaMemcpyDeviceToHost);

CUDACHECK();

got[len] = '\0';

puts(got);

printf("Hostpointer: %p\n", "dummy string constant");

printf("\nDump of __constant__ memory:\n");

grab<<<1, 1>>>(out, len);

CUDACHECK();

cudaThreadSynchronize(); /* wait until no earlier requests left */

CUDACHECK();

len = 288;

for(int i = 0; i < len; ++i) got[i] = '\0';

cudaMemcpy(got, out, len * sizeof(char), cudaMemcpyDeviceToHost);

CUDACHECK();

fhexdump(stdout, got, len, 0);

cudaFree(out);

CUDACHECK();

return 0;

}

[/codebox]

Starting with CUDA 2.3, this issue appears to be resolved.

While the truncations were still present with CUDA 2.2, the test program from my previous post in this thread works as expected with CUDA 2.3: Strings are not truncated anymore and also the dump does no more show zeros where other characters were expected.

This is GREAT to know! Thanks for coming back to inform this on the forums!

Thanks to NV!