Hello, I recently started exploring CUDA for a simulation program we are developing for the cerebellum. I wrote a very simple little prototype to get familiar with the programming model and compilation process. The big problem that I have right now is that the kernel doesn’t seem to execute at all.
Here’s the code for the prototype:
[codebox]/*
-
main_mini.cu
-
Created on: Dec 4, 2009
-
Author: wen
*/
#include <cuda.h>
#include <cuda_runtime.h>
#include <time.h>
using namespace std;
//simple addition kernel
global void test(float *a, float *b)
{
int i=threadIdx.x;
a[i]=a[i]+b[i];
}
int main(int argc, char **argv)
{
//device memory pointers
float *dA;
float *dB;
//host memory for initialization
float hA[1048576];
float hB[1048576];
//initialize host memory
for(float i=0; i<1048576; i++)
{
hA[(int)i]=1/(i+1);
hB[(int)i]=1/((i+1)*(i+1));
}
//allocate device memory
cudaSetDevice(0);
cudaMalloc((void **)&dA, 1048576*sizeof(float));
cudaMalloc((void **)&dB, 1048576*sizeof(float));
//copy over host memory to device memory
cudaMemcpy(dA, hA, 1048576*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dB, hB, 1048576*sizeof(float), cudaMemcpyHostToDevice);
//reset host memory to 0
memset(hA, 0, 1048576*sizeof(float));
memset(hB, 0, 1048576*sizeof(float));
//execution loop
for(int i=0; i<20; i++)
{
test<<<1, 1048576>>>(dA, dB);
cudaThreadSynchronize(); //is this necessary?
//fetch memory from device to examine
cudaMemcpy(hA, dA, 1048576*sizeof(float), cudaMemcpyDeviceToHost);
cout<<hA[0]<<" "<<hA[1]<<endl; //output
//reset host memory to 0
memset(hA, 0, 1048576*sizeof(float));
cout<<hA[0]<<" "<<hA[1]<<endl;
}
}
[/codebox]
basically I set up a simple little kernel to add two arrays together. I allocate the device memory, copy over initial values from the host, and call the kernel, and fetch device memory to see what the values are. What I’ve found is the following:
[codebox]1 0.5
0 0
1 0.5
0 0
1 0.5
0 0
1 0.5
0 0
1 0.5
0 0
1 0.5
0 0
…[/codebox]
Since the values are not changed, it seems that the kernel didn’t execute at all. This is the compile command I used to compile the code:
nvcc main_mini.cu -o test -Xlinker -stack:100000000
I’m probably missing something obvious and simple, but I can’t figure out what’s going on. Could someone help?
On a related question: how do I make sure that a kernel has finished execution before the host code moves on? I read somewhere in the docs that launching kernels is async, so control is returned immediately.
I’m running Windows XP, using a GTX 275.
Thanks a lot.