In nvcc, there appears to be a problem with the handling of more than about 8 string constants. The following program demonstates this:
[codebox]
#include<stdio.h>
#define CUDA_PRINT_BS 1000
#define N CUDA_PRINT_BS
device int pos;
device char *buf;
device void cuda_print_char(char c) {
int p;
//p = pos; ++pos; /* why not, with only 1 thread, output stops early? */
p = atomicAdd(&pos, 1);
//__syncthreads(); /* no change */
if((0 <= p) && (p < CUDA_PRINT_BS)) buf[p] = c;
}
device void cuda_print_txt(char *t) {
cuda_print_char('A' + ((t - (char *)0) & 15)); /* no dependence */
cuda_print_char(' '); /* on address (?) */
while('\0' != *t) {
cuda_print_char(*t);
++t;
}
cuda_print_char('\n');
}
global void work(char *out){
pos = 0;
buf = out;
__syncthreads();
cuda_print_txt("ABC1");
cuda_print_txt("ABC2");
cuda_print_txt("ABC3");
cuda_print_txt("ABC4");
cuda_print_txt("ABC5");
cuda_print_txt("ABC6");
cuda_print_txt("ABC7");
cuda_print_txt("ABC8");
cuda_print_txt("ABC9");
cuda_print_txt("ABC0");
cuda_print_txt("ABc1");
cuda_print_txt("ABc2");
cuda_print_txt("ABc3");
cuda_print_txt("ABc4");
cuda_print_txt("ABc5");
cuda_print_txt("ABc6");
cuda_print_txt("ABc7");
cuda_print_txt("ABc8");
cuda_print_txt("ABc9");
cuda_print_txt("Abc0");
cuda_print_txt("Abc1");
cuda_print_txt("Abc2");
cuda_print_txt("Abc3");
cuda_print_txt("Abc4");
cuda_print_txt("Abc5");
cuda_print_txt("Abc6");
cuda_print_txt("ABC8");
cuda_print_txt("Abc7");
cuda_print_txt("Abc8");
cuda_print_txt("Abc9");
cuda_print_txt("Abc0");
cuda_print_txt("ABC8");
cuda_print_txt("ABc8");
cuda_print_txt("abc8");
cuda_print_txt("aBc0");
cuda_print_txt("aBc9");
cuda_print_txt("aBc8");
cuda_print_txt("aBc7");
cuda_print_txt("aBc6");
cuda_print_txt("aBc5");
cuda_print_txt("aBc4");
cuda_print_txt("aBc3");
cuda_print_txt("aBc2");
cuda_print_txt("aBc1");
cuda_print_txt("ab");
cuda_print_txt("ab51");
cuda_print_txt("abz2");
cuda_print_txt("abc3e");
cuda_print_txt("abc4trzt");
cuda_print_txt("abr55c1");
cuda_print_txt("abe6r61");
cuda_print_txt("abr7wc7");
cuda_print_txt("abw843c8");
cuda_print_txt("abz9wc1");
cuda_print_txt("abe0rwc1");
}
int main(int argc, char *argv) {
char *out;
char got[N + 1];
cudaSetDevice(0);
cudaMalloc((void**)&out, N * sizeof(char));
work<<<1, 1>>>(out);
cudaThreadSynchronize(); /* wait until no earlier requests left */
cudaMemcpy(got, out, N * sizeof(char), cudaMemcpyDeviceToHost);
got[N] = '\0';
puts(got);
cudaFree(out);
return 0;
}
[/codebox]
It can be translated with “nvcc -arch sm_13”. For simplicity, it utilizes only a single cuda thread. The first function copies a char to a buffer on the device and advances a corresponding buffer index. The second function uses the first to output strings to the buffer. The global function uses the second function to output multiple string constants.
Running the program in device-emulation mode produces the expected results. Valgrind has no complaints (except some “Conditional jump or move depends on uninitialised value(s)” within “/lib/ld-2.5.so”). Running it on the device produces the following output:
[codebox]
L ABC1
A ABC2
F ABC3
K ABC4
P ABC5
E ABC6
J ABC7
O A
D ABC9
I ABC0
N ABc1
C ABc2
H ABc3
M ABc4
B ABc5
G ABc6
L ABc7
A A
F ABc9
K Abc0
P Abc1
E Abc2
J Abc3
O Abc4
D Abc5
I Abc6
O A
N Abc7
C A
H Abc9
K Abc0
O A
A A
M abc8
B aBc0
G aBc9
L aBc8
A aBc7
F aBc6
K aBc5
P aBc4
E a
J aBc2
O aBc1
D ab
G ab51
L abz2
A abc3e
A abc4trzt
G abr55c1
O abe6r61
G abr7wc7
J a
O abz9wc1
C abe0rwc1
[/codebox]
[*] Starting with the 8th String constant (Which should be “ABC8”) every 10th new constant is truncated. (As I suspected alignment issues earlier, there is a letter preceeding each line of output. It reflects the last 4 Bits of the address of the string shown with ‘A’ meaning 0, …, ‘P’ meaning 15.)
[*] Specifying a truncated constant again (“ABC8” again after “Abc6”) reuses the truncated version.
[*] Looking at the cubin, constants appear to be allocated in chunks of 10.
[*] The problem may be related to incorrect lengths of strings. Here, the allocations for "ABC7 and “ABC6”:
consts {
name = __constant809
segname = const
segnum = 0
offset = 57
bytes = 8
mem {
0x37434241 0x00000000
}
}
consts {
name = __constant808
segname = const
segnum = 0
offset = 52
bytes = 8
mem {
0x36434241 0x00000000
}
}
Allthough offsets are only separated by 5, lengths are given with 8. Depending on the order of actual initialization, truncations might well be the result…
Is this a nvcc problem or am I doing something wrong?
(On a side node, I’d like to know why the atomicAdd could not be replaced with the line before it without changing program behaviour if only one thread is started.)
Experiments were conducted on a 4 way Dual Core AMD Opteron 270 (1GHz) system running openSUSE 11.1 64-Bit, CUDA 2.1:
CUDA hardware:
Card: nvidia GeForce GTX 280
Architecture: GT200 A2
PCI id: 0x5e1
GPU clock: 399.600 MHz
Bustype: PCI-Express
– Shader info –
Clock: 1296.000 MHz
Stream units: 240 (11111111b)
ROP units: 32 (11111111b)
– Memory info –
Amount: 1024 MB
Type: 512 bit DDR3
Clock: 297.000 MHz
– PCI-Express info –
Current Rate: 16X
Maximum rate: 16X