Hello, I am trying to get acquainted with CUDA before changing any of my existing code to use GPU-acceleration. I started that process by buying a mediocre card (GTX-550 Ti) and writing a simple code that I thought would load the GPU and let me watch it’s temperature over time and also compare its performance to a CPU-only program. The code I wrote defines a 3D image and has the GPU do some meaningless calculations on each pixel of the image. I print the value before and after sending the image to the GPU. I also added a loop both on the CPU and GPU side to ensure a high degree of loading. What I’ve noticed is that over time the device code seems to stop executing. For example, the value of a pixel will be calculated correctly for the first 200-or so iterations of the CPU loop but then will not change from the initialization value.
Also, I have noticed that my computer freezes during device execution and un-freezes during CPU execution. For example, if I type during device execution, the letters will not appear until the current device code terminates.
Just wondering if anyone can fill me in on what’s going on. Code below, thanks.
#include<iostream>
using namespace std;
__global__ void calcs(float *image,int N_2D){
int bid=blockIdx.x;
int tid=threadIdx.x;
int iter_size=N_2D/blockDim.x;
//Loop just to load the processor
for(unsigned int j=0;j<1000;++j){
//Loop over this thread's pixels/elements
for(unsigned int i_dev=0;i_dev<iter_size;++i_dev){
//Just calculate some stuff and set the current pixel/element equal to it
image[bid*N_2D+tid*iter_size+i_dev]=((i_dev)*21.45)/2.3451;
}
}
}
int main(){
//X and Y dimensions
const int n_dim=512;
//Z dimension
const int z_dim=190;
//Pixels in one z element
int N_2D=n_dim*n_dim;
//3D image and device container
float * image=new float[n_dim*n_dim*z_dim];
float * image1;
//Get device properties so we know max number of threads
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop,0);
//Loop just to make sure the program runs for a while
for(unsigned int i=0;i<1000;++i){
cout<<"iteration :"<<i<<endl;
//Initialize image array to i
for(unsigned int z=0;z<z_dim;++z){
for(unsigned int y=0;y<n_dim;++y){
for(unsigned int x=0;x<n_dim;++x){
image[x+y*n_dim+z*n_dim*n_dim]=static_cast<float>(i);
}
}
}
//Allocate memory on the device for image
cudaMalloc((void**)&image1,sizeof(float)*n_dim*n_dim*z_dim);
//Copy image to the device
cudaMemcpy(image1,image,sizeof(float)*n_dim*n_dim*z_dim,cudaMemcpyHostToDevice);
//Execute device function with number of blocks = z_dim and maximum number of threads
calcs<<<z_dim,prop.maxThreadsPerBlock>>>(image1,N_2D);
//Copy result back to host
cudaMemcpy(image,image1,sizeof(float)*n_dim*n_dim*z_dim,cudaMemcpyDeviceToHost);
//Free device memory
cudaFree(image1);
//Output results to ensure consistancy
cout<<"result: "<<image[3]<<endl;
}
return 1;
}