Problems when I use -O3


I have some problems when I use -O3 in my cuda program. I have simplified the problem to this code:

This is the main file:


#include <stdio.h>

#include “”

int main(){

dim3 grid (10);

dim3 block (128);

double * array_d, * array_h;

int i;

array_h = (double ) malloc (10sizeof(double));

cudaMalloc((void **)&array_d, 10*sizeof(double));

for (i=0; i<100; i++){

  prueba_kernel<<<grid, block>>>(array_d, i);

  cudaMemcpy(array_h, array_d, 10*sizeof(double), cudaMemcpyDeviceToHost);

  printf("IT: %d, %lf\n", i, array_h[0]);


cudaFree((void *)array_d);

free((void *) array_h);



This is the file:


global void prueba_kernel(double * array_d, int i){

unsigned int gid = (blockIdx.x * blockDim.x + threadIdx.x);

if (gid < 10){

  array_d[gid] = i;




And I use this to compile:


nvcc -O3 -arch sm_13 -o prueba


The result is (when I execute the kernel several times):


IT: 0, 99.000000

IT: 1, 1.000000

IT: 2, 2.000000

IT: 3, 3.000000

IT: 98, 98.000000

IT: 99, 99.000000


gcc (Debian 4.3.4-6) 4.3.4 and cuda toolkit 3.0beta

Any help?

Thank you so much!

I can confirm what you are seeing but I can’t explain why. The optimization flags are making no difference to the PTX the compiler is generating, which is really weird…

Thank you!

Yes, it’s really strange… Maybe it’s due to gcc…
I’m trying to unroll the first iteration, and then I don’t have the problem, but if the loop is more complicated this solution doesn’t work :(
Nobody can explain us what’s happening?

It works fine for me on both CUDA 2.3 and 3.0 on RHEL4.

What gcc version do you use?

Must be a Ubuntu specific gcc thing…

On RHEL4, gcc 3.4.6.
I also tried RHEL5, gcc 4.1.2, it works fine also there.

You could try this:

  1. remove the include of the kernel
  2. nvcc -O3 -arch sm_13 -c
  3. gcc -O3 prueba.c prueba_kernel.o -L/usr/local/cuda/lib64 -lcudart

Thank you! But gcc can’t find some things:

prueba.c:6: error: ‘dim3’ undeclared (first use in this function)

prueba.c:18: error: ‘cudaMemcpyDeviceToHost’ undeclared (first use in this function)

You will need to include cuda_runtime.h.
You cannot use the <<< >>> syntax in .c, so you will need to move all the kernel launch to the .cu file or use the driver API.

Ok, now it works, but… I can’t do this kind of changes in my real program!!
If it works now, is the problem in nvcc? Hmmm, the program now is quite different. Maybe the compiler does a different optimization… I’m using a function with the kernel lauch in the .cu file.

What do you thing?

I confirm the problem on Debian x64 with gcc 4.3.4 and CUDA Toolkit 3.0 beta.
When the kernel is launched for the first time, cudaSetupArgument ends up being called with some ridiculous value in the ‘offset’ argument (0x7fffffffe1d8), and the launch fails with the error “invalid argument” (please check the return values!).

Looks like a problem between some recent gcc optimizations and the CUDA Runtime. Maybe some ABI mismatch, or just a gcc bug?..


Does “volatile dim3 …” help?

If I install another gcc version, how can use it with nvcc? Is there any enviroment variable?

Thank you.