GPU execution question

Hello, I am trying to get acquainted with CUDA before changing any of my existing code to use GPU-acceleration. I started that process by buying a mediocre card (GTX-550 Ti) and writing a simple code that I thought would load the GPU and let me watch it’s temperature over time and also compare its performance to a CPU-only program. The code I wrote defines a 3D image and has the GPU do some meaningless calculations on each pixel of the image. I print the value before and after sending the image to the GPU. I also added a loop both on the CPU and GPU side to ensure a high degree of loading. What I’ve noticed is that over time the device code seems to stop executing. For example, the value of a pixel will be calculated correctly for the first 200-or so iterations of the CPU loop but then will not change from the initialization value.

Also, I have noticed that my computer freezes during device execution and un-freezes during CPU execution. For example, if I type during device execution, the letters will not appear until the current device code terminates.

Just wondering if anyone can fill me in on what’s going on. Code below, thanks.

#include<iostream>

using namespace std;

__global__ void calcs(float *image,int N_2D){

	int bid=blockIdx.x;

	int tid=threadIdx.x;

	int iter_size=N_2D/blockDim.x;

	//Loop just to load the processor

	for(unsigned int j=0;j<1000;++j){

		//Loop over this thread's pixels/elements

		for(unsigned int i_dev=0;i_dev<iter_size;++i_dev){

			//Just calculate some stuff and set the current pixel/element equal to it

			image[bid*N_2D+tid*iter_size+i_dev]=((i_dev)*21.45)/2.3451;

		}

	}

}

int main(){

	//X and Y dimensions

	const int n_dim=512;

	//Z dimension

	const int z_dim=190;

	//Pixels in one z element

	int N_2D=n_dim*n_dim;

	//3D image and device container

	float * image=new float[n_dim*n_dim*z_dim];

	float * image1;

	//Get device properties so we know max number of threads

	cudaDeviceProp prop;

	cudaGetDeviceProperties(&prop,0);

	//Loop just to make sure the program runs for a while

	for(unsigned int i=0;i<1000;++i){

		cout<<"iteration :"<<i<<endl;

		//Initialize image array to i

		for(unsigned int z=0;z<z_dim;++z){

			for(unsigned int y=0;y<n_dim;++y){

				for(unsigned int x=0;x<n_dim;++x){

					image[x+y*n_dim+z*n_dim*n_dim]=static_cast<float>(i);

				}

			}

		}

		//Allocate memory on the device for image

		cudaMalloc((void**)&image1,sizeof(float)*n_dim*n_dim*z_dim);

		//Copy image to the device

		cudaMemcpy(image1,image,sizeof(float)*n_dim*n_dim*z_dim,cudaMemcpyHostToDevice);

		//Execute device function with number of blocks = z_dim and maximum number of threads

		calcs<<<z_dim,prop.maxThreadsPerBlock>>>(image1,N_2D);

		//Copy result back to host

		cudaMemcpy(image,image1,sizeof(float)*n_dim*n_dim*z_dim,cudaMemcpyDeviceToHost);

		//Free device memory

		cudaFree(image1);

		//Output results to ensure consistancy

		cout<<"result: "<<image[3]<<endl;

	}

return 1;

}