kernel execution and related questions

Hello, I recently started exploring CUDA for a simulation program we are developing for the cerebellum. I wrote a very simple little prototype to get familiar with the programming model and compilation process. The big problem that I have right now is that the kernel doesn’t seem to execute at all.

Here’s the code for the prototype:

[codebox]/*

  • main_mini.cu

  • Created on: Dec 4, 2009

  •  Author: wen
    

*/

#include

#include <cuda.h>

#include <cuda_runtime.h>

#include <time.h>

using namespace std;

//simple addition kernel

global void test(float *a, float *b)

{

int i=threadIdx.x;

a[i]=a[i]+b[i];

}

int main(int argc, char **argv)

{

//device memory pointers

float *dA;

float *dB;

//host memory for initialization

float hA[1048576];

float hB[1048576];

//initialize host memory

for(float i=0; i<1048576; i++)

{

	hA[(int)i]=1/(i+1);

	hB[(int)i]=1/((i+1)*(i+1));

}

//allocate device memory

cudaSetDevice(0);

cudaMalloc((void **)&dA, 1048576*sizeof(float));

cudaMalloc((void **)&dB, 1048576*sizeof(float));

//copy over host memory to device memory

cudaMemcpy(dA, hA, 1048576*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(dB, hB, 1048576*sizeof(float), cudaMemcpyHostToDevice);

//reset host memory to 0

memset(hA, 0, 1048576*sizeof(float));

memset(hB, 0, 1048576*sizeof(float));

//execution loop

for(int i=0; i<20; i++)

{

	test<<<1, 1048576>>>(dA, dB);

	cudaThreadSynchronize(); //is this necessary?

	//fetch memory from device to examine

	cudaMemcpy(hA, dA, 1048576*sizeof(float), cudaMemcpyDeviceToHost);

	cout<<hA[0]<<" "<<hA[1]<<endl; //output

	//reset host memory to 0

	memset(hA, 0, 1048576*sizeof(float));

	cout<<hA[0]<<" "<<hA[1]<<endl;

}

}

[/codebox]

basically I set up a simple little kernel to add two arrays together. I allocate the device memory, copy over initial values from the host, and call the kernel, and fetch device memory to see what the values are. What I’ve found is the following:

[codebox]1 0.5

0 0

1 0.5

0 0

1 0.5

0 0

1 0.5

0 0

1 0.5

0 0

1 0.5

0 0

…[/codebox]

Since the values are not changed, it seems that the kernel didn’t execute at all. This is the compile command I used to compile the code:

nvcc main_mini.cu -o test -Xlinker -stack:100000000

I’m probably missing something obvious and simple, but I can’t figure out what’s going on. Could someone help?

On a related question: how do I make sure that a kernel has finished execution before the host code moves on? I read somewhere in the docs that launching kernels is async, so control is returned immediately.

I’m running Windows XP, using a GTX 275.

Thanks a lot.

you issue wrong execution configuration

test<<<1, 1048576>>>(dA, dB);

from A.1.1 in the programming guide

The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512,

and 64, respectively;

your thread block has size 1048576 bigger than maximum one, 512

Oh ok, that makes sense. I need to read the documentation more closely. Thanks a lot.