General CUDA program structure questions

Markuss.london · September 12, 2010, 2:47pm

Hi there,

as I just started using CUDA, I have got a few general questions, which most of the literature didn’t tell me.

Is the general structure of a CUDA/C project a C-file (host) that calls the CU-file with the kernels (device) and a header file?
Is there a special order to build/compile the different files ? I would like to use visual studio or eclipse to program.
does someone have a very simple example for me for the three files, just to get familiar with the structure?Maybe like a small calculation problem.

Thank you very much in advance

greetz Markus

Magorath · September 13, 2010, 7:38am

Have a look in the SDK examples. You should find everything you need there.

The whole compilation process is also described in the programming guide and in the nvcc manual.

Magorath · September 13, 2010, 7:38am

Have a look in the SDK examples. You should find everything you need there.

The whole compilation process is also described in the programming guide and in the nvcc manual.

Markuss.london · September 13, 2010, 12:16pm

Thanks, I had a look at the programming guide and i found this example ( Matrix multiplication Chapter 3)

// Matrices are stored in row-major order: 

// M(row, col) = *(M.elements + row * M.width + col) 

typedef struct { 

int width; 

int height; 

float* elements; 

} Matrix; 

// Thread block size 

#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code 

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 

void MatMul(const Matrix A, const Matrix B, Matrix C) 

{ 

// Load A and B to device memory 

Matrix d_A; 

d_A.width = A.width; d_A.height = A.height; 

size_t size = A.width * A.height * sizeof(float); 

cudaMalloc(&d_A.elements, size); 

cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 

Matrix d_B; 

d_B.width = B.width; d_B.height = B.height; 

size = B.width * B.height * sizeof(float); 

cudaMalloc(&d_B.elements, size); 

cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

// Allocate C in device memory 

Matrix d_C; 

d_C.width = C.width; d_C.height = C.height; 

size = C.width * C.height * sizeof(float); 

cudaMalloc(&d_C.elements, size); 

// Invoke kernel 

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); 

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

// Read C from device memory 

cudaMemcpy(C.elements, Cd.elements, size, cudaMemcpyDeviceToHost);

// Free device memory 

cudaFree(d_A.elements); 

cudaFree(d_B.elements); 

cudaFree(d_C.elements); 

}

// Matrix multiplication kernel called by MatMul() 

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 

{ 

// Each thread computes one element of C 

// by accumulating results into Cvalue 

float Cvalue = 0; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

int col = blockIdx.x * blockDim.x + threadIdx.x; 

for (int e = 0; e < A.width; ++e) 

Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col]; 

C.elements[row * C.width + col] = Cvalue; 

}

Do I just have to create a .cu file with the above code in and compile it with a Make file ? Or do I need more files ? sorry for the very stupid question but I am really struggling with this.

thanks Markus

Markuss.london · September 13, 2010, 12:16pm

Thanks, I had a look at the programming guide and i found this example ( Matrix multiplication Chapter 3)

// Matrices are stored in row-major order: 

// M(row, col) = *(M.elements + row * M.width + col) 

typedef struct { 

int width; 

int height; 

float* elements; 

} Matrix; 

// Thread block size 

#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code 

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 

void MatMul(const Matrix A, const Matrix B, Matrix C) 

{ 

// Load A and B to device memory 

Matrix d_A; 

d_A.width = A.width; d_A.height = A.height; 

size_t size = A.width * A.height * sizeof(float); 

cudaMalloc(&d_A.elements, size); 

cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 

Matrix d_B; 

d_B.width = B.width; d_B.height = B.height; 

size = B.width * B.height * sizeof(float); 

cudaMalloc(&d_B.elements, size); 

cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

// Allocate C in device memory 

Matrix d_C; 

d_C.width = C.width; d_C.height = C.height; 

size = C.width * C.height * sizeof(float); 

cudaMalloc(&d_C.elements, size); 

// Invoke kernel 

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); 

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

// Read C from device memory 

cudaMemcpy(C.elements, Cd.elements, size, cudaMemcpyDeviceToHost);

// Free device memory 

cudaFree(d_A.elements); 

cudaFree(d_B.elements); 

cudaFree(d_C.elements); 

}

// Matrix multiplication kernel called by MatMul() 

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 

{ 

// Each thread computes one element of C 

// by accumulating results into Cvalue 

float Cvalue = 0; 

int row = blockIdx.y * blockDim.y + threadIdx.y; 

int col = blockIdx.x * blockDim.x + threadIdx.x; 

for (int e = 0; e < A.width; ++e) 

Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col]; 

C.elements[row * C.width + col] = Cvalue; 

}

Do I just have to create a .cu file with the above code in and compile it with a Make file ? Or do I need more files ? sorry for the very stupid question but I am really struggling with this.

thanks Markus

MisterAnderson42 · September 13, 2010, 12:25pm

Download CMake.
Look in the CMake documentation for FindCUDA.cmake
Follow the examples therein.

Much easier than trying to write your own makefile/IDE project files! Don’t even look at the SDK samples, their makefiles are indecipherable.

MisterAnderson42 · September 13, 2010, 12:25pm

Download CMake.
Look in the CMake documentation for FindCUDA.cmake
Follow the examples therein.

Much easier than trying to write your own makefile/IDE project files! Don’t even look at the SDK samples, their makefiles are indecipherable.

Markuss.london · September 13, 2010, 12:36pm

Thank you will do so

Markuss.london · September 13, 2010, 12:36pm

Thank you will do so

Antagonistic · September 14, 2010, 8:09am

No idea if the GCC 4.4.4 bug still exists in the newer versions of the toolkit, but when it comes with program structure I have found that by keeping the .cu file bare minimum, you avoid it. Do not #include any stl files, just include the kernels and a function to call those kernels. Worked for me so far and makes sense on a structure viewpoint.

Antagonistic · September 14, 2010, 8:09am

No idea if the GCC 4.4.4 bug still exists in the newer versions of the toolkit, but when it comes with program structure I have found that by keeping the .cu file bare minimum, you avoid it. Do not #include any stl files, just include the kernels and a function to call those kernels. Worked for me so far and makes sense on a structure viewpoint.

zeus13i · September 15, 2010, 11:20pm

CMake is great and after 1 year of writting my own makefiles I’ve finally started using it but, I must say, the CUDA SDK makefiles are perhaps the best I’ve ever seen!

Edit:

I should add that FindCUDA.cmake doesn’t support CUTILS by default (as, I believe, NVIDIA doesn’t encourage their use outside of the SDK examples), although you can get CMake to include it for you with a bit of extra work as described here.

zeus13i · September 15, 2010, 11:20pm

CMake is great and after 1 year of writting my own makefiles I’ve finally started using it but, I must say, the CUDA SDK makefiles are perhaps the best I’ve ever seen!

Edit:

I should add that FindCUDA.cmake doesn’t support CUTILS by default (as, I believe, NVIDIA doesn’t encourage their use outside of the SDK examples), although you can get CMake to include it for you with a bit of extra work as described here.

Topic		Replies	Views
CMake and cuPrintf CUDA Programming and Performance	7	15905	February 5, 2010
C++ with classes example? Does anyone know of one/have one? CUDA Programming and Performance	13	7433	April 30, 2009
CMAKE'ing a CUDA static library of kernels issues CUDA Programming and Performance	7	1060	October 6, 2023
CUDA 9.0 on MacOSX High Sierra - samples compile, cmake projects don't CUDA Setup and Installation	5	1532	September 9, 2019
CUDA C language compatibility CUDA Programming and Performance	11	3701	July 15, 2009
How do I CUDA on Windows? CUDA Programming and Performance	12	6224	November 8, 2016
Building Cross-Platform CUDA Applications with CMake Technical Blog	79	4007	October 27, 2021
CMake build script for Cuda. FindCuda.cmake script available. CUDA Programming and Performance	44	60096	January 16, 2011
How did the CUDA experts get started with CUDA programming? CUDA Programming and Performance	6	4179	October 2, 2023
CUDA using Eclipse CDT, CMake and Qt Setup Eclipse to use CMake,CUDA,Qt CUDA Programming and Performance	9	14178	February 24, 2008

General CUDA program structure questions

Related topics