Program works on emu but not on release

Hello, i have recently started to program in CUDA, and to test it’s performance i tried to create a simple program to see how fast it evolved in my regular processor compared to my GTX 280. After finishing the program, i ran it in emulation mode, and it gave the desired results. I then ran it on release mode and it didn’t make any math, always giving me “0 meters”. I checked the internet and found out about something i had to change in Visual Studio 2008 in order for it to use double precision, since i used that in my code. I marked under “project properties -> CUDA” the GPU Architecture Compile Name as 1.3 (virtual) Arch, and at GPU Architecture Code Name as 1.3 (hardware) code. Now my code worked, but not as expected and not as it works in emulation mode, it puts some numbers that really does not make much sense to me. Below is my code.

As mentioned, I’m using Visual Studio 2008 in Windows XP with my Nvidia Geforce GTX 280.

What i do is basically the following loop in CUDA:

[codebox]while (1){

d += (i/t);


printf("\nWalked %lf meters in %.0lf steps\n", d, t);


In CUDA, i created the following program, which is basically the same loop but shows at every 200 steps (which i will probably alter later if i get it to work XD) (by the way, i didn’t change much of the name of the functions i got form the examples and tutorials, so bear with it XD):

[codebox]// includes, system

#include <stdio.h>

#include <assert.h>

#include <signal.h>

#define NUMBLOCKS 10


#define REAL double

// pointer for host memory

REAL *h_a;

REAL *h_b;

// pointer for device memory

REAL *d_a;

REAL *d_b;

// define grid and block size

int numBlocks = NUMBLOCKS;

int numThreadsPerBlock = NUMTHREADSPERBLOCKS;

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(REAL);


FILE *fp;

// Simple utility function to check for CUDA runtime errors

void checkCUDAError(const char *msg);

global void myFirstKernel(REAL* d_a, REAL* d_b)


int idx = blockIdx.x*blockDim.x + threadIdx.x;

d_b[0] += ((1.0)/d_a[idx]);



void sair(){

cudaMemcpy(h_b, d_b, sizeof(REAL), cudaMemcpyDeviceToHost);

printf("\nWalked %lf meters in %.0lf steps\n", h_b[0], k);



// Program main


int main( int argc, char** argv)


cudaMallocHost((void **)&h_a, memSize);

cudaMallocHost((void **)&h_b, sizeof(REAL));

cudaMalloc((void **)&d_a, memSize);

cudaMalloc((void **)&d_b, sizeof(REAL));

dim3 dimGrid(numBlocks);

dim3 dimBlock(numThreadsPerBlock);

k = 1;

*h_b = 0.0;

for (int i = 0; i < numBlocks; i++)


   	for (int j = 0; j < numThreadsPerBlock; j++)


    h_a[i * numThreadsPerBlock + j] = k;





cudaMemcpy(d_b, h_b, sizeof(REAL), cudaMemcpyHostToDevice);

cudaMemcpy(d_a, h_a, memSize, cudaMemcpyHostToDevice);


myFirstKernel<<<dimGrid,dimBlock>>>(d_a, d_b);

// block until the device has completed


// check if kernel execution generated an error

checkCUDAError("kernel execution");


system("PAUSE"); //added to check result, take out to check performance

k += (numBlocks*numThreadsPerBlock);

goto loop;

return 0;


void checkCUDAError(const char *msg)


cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 


    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );





PS: srry if the code doesnt come out right, i havent learned how to properly use this forum functions :P

in case this problem might be in compilation on release mode, this is the compile code that seems Visual Studio uses:

[codebox]“C:\CUDA\bin\nvcc.exe” -ccbin “c:\Program Files\Microsoft Visual Studio 9.0\VC\bin” -I"C:\CUDA\include" -I"C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc"

-O2 -D_CONSOLE -arch compute_13 -code sm_13 --host-compilation C -c -m 32 -o “Release\CUDAWinApp1.obj” -odir “Release” -ext none -int real “c:\Computacao\CUDAWinApp1\CUDAWinApp1\CUDAWinApp1.vcproj”


if anyone can plz help me, i would greatly appreciate it. Thanks in advance.

bumping this up, can anyone plz help me? I really have no idea what to try next.

in case anyone is interested, solved the problem, it was an error due to the memory access on the d_b[0] variable, i changed it so every thread calculates their value on one element of a vector, d_b[idx], and then i sum it all up at the time of printing on screen. If anyone has any idea how to optimize this code for performance, i would be grateful, and also, how do i know how many threads and blocks to use to maximize performance, just playing with it until i find a good value?