Small question about function call

844280040 · April 7, 2020, 6:30am

Hi, I just got into CUDA 10.2, and I am learning the example (0_simple\VectorAdd), now I just type “printf” to see what happened, but it seems like the main function never calls this "global void vectorAdd"

I am using VIsual Studio 2019 to open the .sln file, and it built, but just can not use this printf.

Thank you!

njuffa · April 7, 2020, 8:42am

Make sure you have a call to cudaDeviceSynchronize() at the end of your program. The output of printf() is typically buffered, which is also true of CUDA’s device-side printf. cudaDeviceSynchronize() causes the buffer to be flushed, just like one would use fflush(stdout) for the same purpose for host-side printf.

Since you haven’t shown the entire program, there may be other issues with the code that we can’t see.

844280040 · April 7, 2020, 1:23pm

Thank you njuffa, I am not quite understanding about calling cudaDeviceSynchronize().

The code is as follows

/**

Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
Please refer to the NVIDIA end user license agreement (EULA) associated
with this source code for terms and conditions that govern your use of
this software. Any use, reproduction, disclosure, or distribution of
this software and related documentation outside the terms of the EULA
is strictly prohibited.

*/

/**

Vector addition: C = A + B.
This sample is a very basic sample that implements element by element
vector addition. It is the same as the sample illustrating Chapter 2
of the programming guide with some additions like error checking.
*/

include <stdio.h>

// For the CUDA runtime routines (prefixed with “cuda_”)
include <cuda_runtime.h>

include <helper_cuda.h>
/**

CUDA Kernel Device code
Computes the vector addition of A and B into C. The 3 vectors have the same
number of elements numElements.
*/
global void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
printf(“%d \n”, i);
if (i < numElements)
{
C[i] = A[i] + B[i];

}
}

/**

Host main routine
*/
int
main(void)
{
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;

// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf(“[Vector addition of %d elements]\n”, numElements);

// Allocate the host input vector A
float *h_A = (float *)malloc(size);

// Allocate the host input vector B
float *h_B = (float *)malloc(size);

// Allocate the host output vector C
float *h_C = (float *)malloc(size);

// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL)
{
fprintf(stderr, “Failed to allocate host vectors!\n”);
exit(EXIT_FAILURE);
}

// Initialize the host input vectors
for (int i = 0; i < numElements; ++i)
{
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}

// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to allocate device vector A (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Allocate the device input vector B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, size);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to allocate device vector B (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Allocate the device output vector C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to allocate device vector C (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Copy the host input vectors A and B in host memory to the device input vectors in
// device memory
printf(“Copy input data from the host memory to the CUDA device\n”);
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to copy vector A from host to device (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to copy vector B from host to device (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf(“CUDA kernel launch with %d blocks of %d threads\n”, blocksPerGrid, threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
err = cudaGetLastError();

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to launch vectorAdd kernel (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Copy the device result vector in device memory to the host result vector
// in host memory.
printf(“Copy output data from the CUDA device to the host memory\n”);
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to copy vector C from device to host (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i)
{
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
{
fprintf(stderr, “Result verification failed at element %d!\n”, i);
exit(EXIT_FAILURE);
}
}

printf(“Test PASSED\n”);

// Free device global memory
err = cudaFree(d_A);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to free device vector A (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

err = cudaFree(d_B);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to free device vector B (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

err = cudaFree(d_C);

if (err != cudaSuccess)
{
fprintf(stderr, “Failed to free device vector C (error code %s)!\n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

// Free host memory
free(h_A);
free(h_B);
free(h_C);

printf(“Done\n”);
return 0;
}

rs277 · April 8, 2020, 1:03am

Have a look at this section of the Programming Guide:

844280040 · April 8, 2020, 5:34am

Thank you rs277, I cannot believe that I restarted my computer, and it is good now.

Topic		Replies	Views
cudaMemcpy Failing To Copy Variable From Device To Host Correctly CUDA Programming and Performance	3	2734	April 26, 2021
Global memory occupied until cudaDeviceReset() or app exits CUDA Programming and Performance	0	2504	June 25, 2014
Segmentation fault (core dumped) CUDA Programming and Performance	4	13075	May 13, 2017
just for fun! my own implementation of 'cuPrintf()' enabling output debug message from k CUDA Programming and Performance	3	2543	March 31, 2010
GeForce 335M + Visual Studio 2012 CUDA Setup and Installation	7	1403	November 30, 2015
Error: kernel launch from __device__ or __global__ functions requires separate compilation mode CUDA Programming and Performance cuda	9	2745	November 20, 2023
Unespected output for a basic program CUDA Programming and Performance	6	928	December 10, 2014
compilation CUDA Programming and Performance	3	7870	March 25, 2010
Zero output in basic Vector Addition application in CUDA CUDA Programming and Performance	8	4854	January 18, 2011
I am new to cuda programming. In this code, c matric return by GPU is Zero matrix. I tried different... CUDA Programming and Performance	0	443	July 3, 2018

Small question about function call

Related topics