What's wrong with this? (malign-double?, passing integer array?, bug?)

Hello everyone,

I was having trouble with one of my kernel in my CFD code. I created this simple code that illustrates the problem I have.

[codebox]

#include <stdlib.h>

#include <stdio.h>

#include

#include

#include

#include <cuda.h>

#include <string.h>

global void kernel1(float *g_array_out, int *g_table, int dummy1, int dummy2 )

{

            // ints

            int j,index_l;

            int b_pid, g_pid;

            int tid = threadIdx.x;

//int k=0;

            for(int k =0; k < dummy1; k++)

            {

                    b_pid  =  k*blockDim.x + tid;

                    g_pid =  blockIdx.x*dummy1*blockDim.x + b_pid;

if(g_pid < dummy2)

                    {

                            index_l = g_table[g_pid];

                            //index_l = g_pid;

for(j = 0; j < 4; j++)

                            {

                                    g_array_out[index_l] = 1.;

                                    index_l +=  dummy2;

                            }

}

            }

}

int main()

{

    float *h_array_out= 0;

    float *d_array_out= 0;

int *d_table = 0;

    int *h_table = 0;

int block_size = 16;

    int grid_size = 1;

// allocate look_up table

    cudaMalloc((void**) &d_table, 16*sizeof(int));

    cudaMemset(d_table,0,         16*sizeof(int));

    h_table = (int*)malloc(       16*sizeof(int));

for(int g_pid=0;g_pid<16;g_pid++)

    {

            h_table[g_pid] = g_pid;

    }

cudaMemcpy(d_table, h_table, 16*sizeof(int), cudaMemcpyHostToDevice);

    free(h_table);

// allocate array_out

    cudaMalloc((void**) &d_array_out, sizeof(float)*4*16);

    cudaMemset(d_array_out,0,sizeof(float)*4*16);

if (d_array_out== 0 )

    {

            printf("couldn't allocate device memory\n");

            exit(1);

    }

kernel1<<<grid_size,block_size>>>(d_array_out,d_table,1,16);

h_array_out= (float*)malloc( 416sizeof(float));

    cudaMemcpy(h_array_out,d_array_out,sizeof(float)*4*16,cudaMe

mcpyDeviceToHost);

for(int i=0;i<4*16;i++)

    {

            printf("%f\n",h_array_out[i]);

    }

    free(h_array_out);

cudaFree(d_array_out);

    cudaFree(d_table);

}

[/codebox]

I would expect the output array (array_out) to be an array full of ones. However, when I compile with:

nvcc -arch=sm_13 -I/usr/local/cuda/include …/src/cuda_methods.cu -o cuda_exec

and execute the program, the “array_out” array has zeros in positions [16:47]. What is also really strange is that if I comment the outer loop in the kernel (since dummy1 is equal to 1 anyway), then I obtain what I expect: array_out full of ones. Also, if I replace the line: “index_l = d_table[g_pid]” by “index_l = g_pid”, I also get the good answer, even though d_table[g_pid] = g_pid. It seems like it’s something related to integer arrays not being dereferenced properly. I heard about the “-malign-double” option but that doesn’t seem to work when used with nvcc.

Do you guys have any idea what’s going on?

I would really appreciate your help.

Update: If I compile with the -m32 flag, then I get the results I expect. In the Cuda 3.1 release note, in the “Known Issues” section, I found the following:

When compiling with GCC, special care must be taken for structs that

contain 64-bit integers. This is because GCC aligns long longs

to a 4 byte boundary by default, while NVCC aligns long longs

to an 8 byte boundary by default. Thus, when using GCC to

compile a file that has a struct/union, users must give the

-malign-double

option to GCC. When using NVCC, this option is automatically

passed to GCC.

However, when I compile with the “–verbose” option, without the -m32 option, I don’t see the “malign-double” option being passed to gcc or g++. Here is the output

[codebox]

#$ SPACE=

#$ CUDART=cudart

#$ HERE=/usr/local/cuda/bin

#$ THERE=/usr/local/cuda/bin

#$ TARGET_SIZE=64

#$ TOP=/usr/local/cuda/bin/…

#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/…/lib:/usr/local/cuda/bin/…/extools/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib64:/usr/lib64/mpich2/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:

#$ PATH=/usr/local/cuda/bin/…/open64/bin:/usr/local/cuda/bin:/usr/local/tecplot/bin:/usr/local/cuda/bin:/usr/lib64/mpich2/bin:/home/pcasto/mpich2-install/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/pcasto/bin

#$ INCLUDES=“-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart”

#$ LIBRARIES= “-L/usr/local/cuda/bin/…/lib64” -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -D__CUDA_ARCH__=130 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -include “cuda_runtime.h” -m64 -o “/tmp/tmpxft_00000662_00000000-4_cuda_methods.cpp1.ii” “…/src/cuda_methods.cu”

#$ cudafe --m64 --gnu_version=40404 -tused --no_remove_unneeded_entities --gen_c_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.c” --stub_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.stub.c” --gen_device_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.gpu” --include_file_name “/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “/tmp/tmpxft_00000662_00000000-4_cuda_methods.cpp1.ii”

#$ gcc -D__CUDA_ARCH__=130 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-5_cuda_methods.cpp2.i” “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.gpu”

#$ cudafe --m64 --gnu_version=40404 --c --gen_c_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.c” --stub_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.stub.c” --gen_device_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.gpu” --include_file_name “/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “/tmp/tmpxft_00000662_00000000-5_cuda_methods.cpp2.i”

#$ gcc -D__CUDA_ARCH__=130 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDABE__ -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.gpu”

#$ filehash -s " " “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” > “/tmp/tmpxft_00000662_00000000-8_cuda_methods.hash”

#$ nvopencc -TARG:compute_13 -m64 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0 “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” -o “/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx”

#$ ptxas -arch=sm_13 -m64 “/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx” -o “/tmp/tmpxft_00000662_00000000-9_cuda_methods.sm_13.cubin”

#$ fatbin --key=“ffed6152a4eff03f” --source-name=“…/src/cuda_methods.cu” --usage-mode=" " --embedded-fatbin=“/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “–image=profile=compute_13,file=/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx” “–image=profile=sm_13,file=/tmp/tmpxft_00000662_00000000-9_cuda_methods.sm_13.cubin”

#$ gcc -E -x c++ “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -include “cuda_runtime.h” -m64 -o “/tmp/tmpxft_00000662_00000000-10_cuda_methods.cpp4.ii” “…/src/cuda_methods.cu”

#$ cudafe++ --m64 --gnu_version=40404 --parse_templates --gen_c_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.cpp” --stub_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.stub.c” “/tmp/tmpxft_00000662_00000000-10_cuda_methods.cpp4.ii”

#$ gcc -D__CUDA_ARCH__=130 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-11_cuda_methods.ii” “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.cpp”

#$ gcc -c -x c++ “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -fpreprocessed -m64 -o “/tmp/tmpxft_00000662_00000000-12_cuda_methods.o” “/tmp/tmpxft_00000662_00000000-11_cuda_methods.ii”

#$ g++ -m64 -o “cuda_methods” -Wl,–start-group “/tmp/tmpxft_00000662_00000000-12_cuda_methods.o” “-L/usr/local/cuda/bin/…/lib64” -lcudart -Wl,–end-group

[/codebox]

The code I posted is quite simple, yet I can’t get it to compile and give me the expected result. Has anyone tried to compile and run the code? I would be curious to see if you’re getting the same result. I would really appreciate any help.

Update: If I compile with the -m32 flag, then I get the results I expect. In the Cuda 3.1 release note, in the “Known Issues” section, I found the following:

When compiling with GCC, special care must be taken for structs that

contain 64-bit integers. This is because GCC aligns long longs

to a 4 byte boundary by default, while NVCC aligns long longs

to an 8 byte boundary by default. Thus, when using GCC to

compile a file that has a struct/union, users must give the

-malign-double

option to GCC. When using NVCC, this option is automatically

passed to GCC.

However, when I compile with the “–verbose” option, without the -m32 option, I don’t see the “malign-double” option being passed to gcc or g++. Here is the output

[codebox]

#$ SPACE=

#$ CUDART=cudart

#$ HERE=/usr/local/cuda/bin

#$ THERE=/usr/local/cuda/bin

#$ TARGET_SIZE=64

#$ TOP=/usr/local/cuda/bin/…

#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/…/lib:/usr/local/cuda/bin/…/extools/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib64:/usr/lib64/mpich2/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:

#$ PATH=/usr/local/cuda/bin/…/open64/bin:/usr/local/cuda/bin:/usr/local/tecplot/bin:/usr/local/cuda/bin:/usr/lib64/mpich2/bin:/home/pcasto/mpich2-install/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/pcasto/bin

#$ INCLUDES=“-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart”

#$ LIBRARIES= “-L/usr/local/cuda/bin/…/lib64” -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -D__CUDA_ARCH__=130 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -include “cuda_runtime.h” -m64 -o “/tmp/tmpxft_00000662_00000000-4_cuda_methods.cpp1.ii” “…/src/cuda_methods.cu”

#$ cudafe --m64 --gnu_version=40404 -tused --no_remove_unneeded_entities --gen_c_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.c” --stub_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.stub.c” --gen_device_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.gpu” --include_file_name “/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “/tmp/tmpxft_00000662_00000000-4_cuda_methods.cpp1.ii”

#$ gcc -D__CUDA_ARCH__=130 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-5_cuda_methods.cpp2.i” “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.gpu”

#$ cudafe --m64 --gnu_version=40404 --c --gen_c_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.c” --stub_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.stub.c” --gen_device_file_name “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.gpu” --include_file_name “/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “/tmp/tmpxft_00000662_00000000-5_cuda_methods.cpp2.i”

#$ gcc -D__CUDA_ARCH__=130 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDABE__ -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” “/tmp/tmpxft_00000662_00000000-6_cuda_methods.cudafe2.gpu”

#$ filehash -s " " “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” > “/tmp/tmpxft_00000662_00000000-8_cuda_methods.hash”

#$ nvopencc -TARG:compute_13 -m64 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0 “/tmp/tmpxft_00000662_00000000-7_cuda_methods.cpp3.i” -o “/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx”

#$ ptxas -arch=sm_13 -m64 “/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx” -o “/tmp/tmpxft_00000662_00000000-9_cuda_methods.sm_13.cubin”

#$ fatbin --key=“ffed6152a4eff03f” --source-name=“…/src/cuda_methods.cu” --usage-mode=" " --embedded-fatbin=“/tmp/tmpxft_00000662_00000000-3_cuda_methods.fatbin.c” “–image=profile=compute_13,file=/tmp/tmpxft_00000662_00000000-2_cuda_methods.ptx” “–image=profile=sm_13,file=/tmp/tmpxft_00000662_00000000-9_cuda_methods.sm_13.cubin”

#$ gcc -E -x c++ “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDACC__ -C -include “cuda_runtime.h” -m64 -o “/tmp/tmpxft_00000662_00000000-10_cuda_methods.cpp4.ii” “…/src/cuda_methods.cu”

#$ cudafe++ --m64 --gnu_version=40404 --parse_templates --gen_c_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.cpp” --stub_file_name “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.stub.c” “/tmp/tmpxft_00000662_00000000-10_cuda_methods.cpp4.ii”

#$ gcc -D__CUDA_ARCH__=130 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -D__CUDA_FTZ -m64 -o “/tmp/tmpxft_00000662_00000000-11_cuda_methods.ii” “/tmp/tmpxft_00000662_00000000-1_cuda_methods.cudafe1.cpp”

#$ gcc -c -x c++ “-I/usr/local/cuda/bin/…/include” “-I/usr/local/cuda/bin/…/include/cudart” -I. -fpreprocessed -m64 -o “/tmp/tmpxft_00000662_00000000-12_cuda_methods.o” “/tmp/tmpxft_00000662_00000000-11_cuda_methods.ii”

#$ g++ -m64 -o “cuda_methods” -Wl,–start-group “/tmp/tmpxft_00000662_00000000-12_cuda_methods.o” “-L/usr/local/cuda/bin/…/lib64” -lcudart -Wl,–end-group

[/codebox]

The code I posted is quite simple, yet I can’t get it to compile and give me the expected result. Has anyone tried to compile and run the code? I would be curious to see if you’re getting the same result. I would really appreciate any help.

Hi pcasto,

Thanks for your report. I’ll try to reproduce this issue and report a bug if it’s not already fixed.

–Cliff

Hi pcasto,

Thanks for your report. I’ll try to reproduce this issue and report a bug if it’s not already fixed.

–Cliff