General CUDA program structure questions

Hi there,

as I just started using CUDA, I have got a few general questions, which most of the literature didn’t tell me.

  1. Is the general structure of a CUDA/C project a C-file (host) that calls the CU-file with the kernels (device) and a header file?
  2. Is there a special order to build/compile the different files ? I would like to use visual studio or eclipse to program.
  3. does someone have a very simple example for me for the three files, just to get familiar with the structure?Maybe like a small calculation problem.

Thank you very much in advance

greetz Markus

Have a look in the SDK examples. You should find everything you need there.

The whole compilation process is also described in the programming guide and in the nvcc manual.

Have a look in the SDK examples. You should find everything you need there.

The whole compilation process is also described in the programming guide and in the nvcc manual.

Thanks, I had a look at the programming guide and i found this example ( Matrix multiplication Chapter 3)

// Matrices are stored in row-major order: 

// M(row, col) = *(M.elements + row * M.width + col) 

typedef struct { 

int width; 

int height; 

float* elements; 

} Matrix; 

// Thread block size 

#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code 

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 

void MatMul(const Matrix A, const Matrix B, Matrix C) 

{ 

// Load A and B to device memory 

Matrix d_A; 

d_A.width = A.width; d_A.height = A.height; 

size_t size = A.width * A.height * sizeof(float); 

cudaMalloc(&d_A.elements, size); 

cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 

Matrix d_B; 

d_B.width = B.width; d_B.height = B.height; 

size = B.width * B.height * sizeof(float); 

cudaMalloc(&d_B.elements, size); 

cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

// Allocate C in device memory 

Matrix d_C; 

d_C.width = C.width; d_C.height = C.height; 

size = C.width * C.height * sizeof(float); 

cudaMalloc(&d_C.elements, size); 

// Invoke kernel 

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); 

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

// Read C from device memory 

cudaMemcpy(C.elements, Cd.elements, size, cudaMemcpyDeviceToHost);

// Free device memory 

cudaFree(d_A.elements); 

cudaFree(d_B.elements); 

cudaFree(d_C.elements); 

}

// Matrix multiplication kernel called by MatMul() 

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 

{ 

// Each thread computes one element of C 

// by accumulating results into Cvalue 

float Cvalue = 0; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

int col = blockIdx.x * blockDim.x + threadIdx.x; 

for (int e = 0; e < A.width; ++e) 

Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col]; 

C.elements[row * C.width + col] = Cvalue; 

}

Do I just have to create a .cu file with the above code in and compile it with a Make file ? Or do I need more files ? sorry for the very stupid question but I am really struggling with this.

thanks Markus

Thanks, I had a look at the programming guide and i found this example ( Matrix multiplication Chapter 3)

// Matrices are stored in row-major order: 

// M(row, col) = *(M.elements + row * M.width + col) 

typedef struct { 

int width; 

int height; 

float* elements; 

} Matrix; 

// Thread block size 

#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code 

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 

void MatMul(const Matrix A, const Matrix B, Matrix C) 

{ 

// Load A and B to device memory 

Matrix d_A; 

d_A.width = A.width; d_A.height = A.height; 

size_t size = A.width * A.height * sizeof(float); 

cudaMalloc(&d_A.elements, size); 

cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 

Matrix d_B; 

d_B.width = B.width; d_B.height = B.height; 

size = B.width * B.height * sizeof(float); 

cudaMalloc(&d_B.elements, size); 

cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

// Allocate C in device memory 

Matrix d_C; 

d_C.width = C.width; d_C.height = C.height; 

size = C.width * C.height * sizeof(float); 

cudaMalloc(&d_C.elements, size); 

// Invoke kernel 

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); 

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

// Read C from device memory 

cudaMemcpy(C.elements, Cd.elements, size, cudaMemcpyDeviceToHost);

// Free device memory 

cudaFree(d_A.elements); 

cudaFree(d_B.elements); 

cudaFree(d_C.elements); 

}

// Matrix multiplication kernel called by MatMul() 

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 

{ 

// Each thread computes one element of C 

// by accumulating results into Cvalue 

float Cvalue = 0; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

int col = blockIdx.x * blockDim.x + threadIdx.x; 

for (int e = 0; e < A.width; ++e) 

Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col]; 

C.elements[row * C.width + col] = Cvalue; 

}

Do I just have to create a .cu file with the above code in and compile it with a Make file ? Or do I need more files ? sorry for the very stupid question but I am really struggling with this.

thanks Markus

  1. Download CMake.
  2. Look in the CMake documentation for FindCUDA.cmake
  3. Follow the examples therein.

Much easier than trying to write your own makefile/IDE project files! Don’t even look at the SDK samples, their makefiles are indecipherable.

  1. Download CMake.
  2. Look in the CMake documentation for FindCUDA.cmake
  3. Follow the examples therein.

Much easier than trying to write your own makefile/IDE project files! Don’t even look at the SDK samples, their makefiles are indecipherable.

Thank you will do so

Thank you will do so

No idea if the GCC 4.4.4 bug still exists in the newer versions of the toolkit, but when it comes with program structure I have found that by keeping the .cu file bare minimum, you avoid it. Do not #include any stl files, just include the kernels and a function to call those kernels. Worked for me so far and makes sense on a structure viewpoint.

No idea if the GCC 4.4.4 bug still exists in the newer versions of the toolkit, but when it comes with program structure I have found that by keeping the .cu file bare minimum, you avoid it. Do not #include any stl files, just include the kernels and a function to call those kernels. Worked for me so far and makes sense on a structure viewpoint.

CMake is great and after 1 year of writting my own makefiles I’ve finally started using it but, I must say, the CUDA SDK makefiles are perhaps the best I’ve ever seen!

Edit:

I should add that FindCUDA.cmake doesn’t support CUTILS by default (as, I believe, NVIDIA doesn’t encourage their use outside of the SDK examples), although you can get CMake to include it for you with a bit of extra work as described here.

CMake is great and after 1 year of writting my own makefiles I’ve finally started using it but, I must say, the CUDA SDK makefiles are perhaps the best I’ve ever seen!

Edit:

I should add that FindCUDA.cmake doesn’t support CUTILS by default (as, I believe, NVIDIA doesn’t encourage their use outside of the SDK examples), although you can get CMake to include it for you with a bit of extra work as described here.