CUDA hangups

qojote · December 12, 2016, 12:09pm

Hello,

I want to use the GPU of the TK1 board to calculate the absolute values of complex numbers. When i run the kernel in a loop, the average execution time is about 300us (pretty slow anyway…). But sometimes the GPU needs several milliseconds. The worst thing i saw was 100ms for one run.
I am using CUDA toolkit 6.5 (cross compiling)
NVCC version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Wed_Aug_27_10:36:36_CDT_2014
Cuda compilation tools, release 6.5, V6.5.16

Compiling is done by hand:

/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -M -o "main.d" "main.cu"
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --compile --relocatable-device-code=false --target-cpu-architecture ARM -m32 -x cu -o "main.o" "main.cu"
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -L/opt/cuda-6.5-linaro/lib --cudart shared -Xlinker --unresolved-symbols=ignore-in-object-files --relocatable-device-code=false -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -ccbin arm-linux-gnueabihf-g++ -link -o "CUDAHang" main.o

This is the test code main.cu:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#define ARRAY_SIZE 	4088
#define NUM_RUNS	2000

int imin(int a, int b)
{
    return (a < b ? a : b);
}

const int _threadsPerBlock = 1024;
const int _blocksPerGrid = imin(32, (ARRAY_SIZE/2+_threadsPerBlock-1)/_threadsPerBlock);

typedef unsigned long long timestamp_t;
inline timestamp_t get_timestamp() {
    struct timeval now;
    gettimeofday(&now, NULL);
    return now.tv_usec + (timestamp_t) now.tv_sec * 1000000.0;
}

// Calculate the abolsute values of the input array with complex numbers.
__global__ void absolute_values(float * d_out, float * d_in){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
	// 
	if(idx < ARRAY_SIZE/2)
	{
		float img = d_in[idx*2];
		float real = d_in[idx*2+1];
		d_out[idx] = img*img + real*real;
	}
}

int main(int argc, char ** argv)
{
	// Assume there is only one device
	cudaError_t err = cudaSetDevice(0);
	if (err != cudaSuccess) 
	{
	  printf("Cuda Error: %d - %s\n", err, cudaGetErrorString(err));
	  return -1;
	}
	cudaDeviceReset();

	// Generate host input and output array
    float h_in[ARRAY_SIZE];		// Array with complex numbers
	float h_out[ARRAY_SIZE/2];	// Array with absolute values (half the size)
    for (int i = 0; i < ARRAY_SIZE; i++)
        h_in[i] = float(i);

    // Declare GPU memory pointers
    float * d_in;
    float * d_out;

    // Allocate GPU memory
    cudaMalloc((void**) &d_in, ARRAY_SIZE * sizeof(float));
    cudaMalloc((void**) &d_out, ARRAY_SIZE/2 * sizeof(float));

    // Transfer input array to GPU
    cudaMemcpy(d_in, h_in, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice);

	timestamp_t before;
	int current = 0;
	int minimum = 0;
	int maximum = 0;
	int average = 0;
	// Launch the kernel with threads only
	printf("Blocks: %d a %d threads for %d values.\n", _blocksPerGrid, _threadsPerBlock, ARRAY_SIZE);
	for(int i = 0; i < NUM_RUNS; i++)
	{
		before = get_timestamp();
		

		absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
		
		///>>> Synchronization can be done by copying data or by an explicit call... hangups in both cases
		//cudaDeviceSynchronize();
		cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

		// Some statistics
		current = (get_timestamp() - before);
		if(current > minimum*3)
		{
    		if(i != 0)	printf(">>> ");
			printf("Run %d took %d us (three times minimum).\n", i, current);
		}
		if(current < minimum || i == 0)	minimum = current;
		if(current > maximum || i == 0)	maximum = current;
		average += current;
	}

    // Copy back the result array to the CPU
    cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

	printf("Minimum: %d us, Maximum: %d us, Average: %d us.\n", minimum, maximum, average/NUM_RUNS);

    cudaFree(d_in);
    cudaFree(d_out);

    return 0;
}

Thanks for help!

AastaLLL · December 23, 2016, 6:08am

Hi,

I tried to profile kernel part without memcpy (since memcpy may related to IO control).
Although sometime it really takes longer, average execution time is acceptable and close to minimal case.

Minimum: 22 us, Maximum: 1646 us, Average: 35 us.

absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
current = (get_timestamp() - before);

///>>> Synchronization can be done by copying data or by an explicit call... hangups in both cases
//cudaDeviceSynchronize();
cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

qojote · January 2, 2017, 8:35am

Hello,

Thanks for your profiling work. I am wondering why some runs need significantly more thime then the average (50x)? I need to evaluate data constantly so this random delay could be critical for me.

AastaLLL · January 12, 2017, 3:12pm

Hi,

In this case, cuda are busy in accessing global memory but only do a few work.
This make IO/thread >> work/thread. Which may lead to hangup when waiting IO.

Also profiled main.cu by tegrastats

sudo ./tegrastats

RAM 1225/3995MB (lfb 480x4MB) cpu [98%,0%,0%,100%]@1734 EMC 1%@1600 AVP 1%@80 NVDEC 268 MSENC 268 GR3D 23%@998 EDP limit 1734
GPU utilization is not high, may indicate gpu are waiting for other jobs.

More, you can run device query to check gpu memory rate.

ubuntu@tegra-ubuntu:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery

My tx1:

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3995 MBytes (4188778496 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            72 MHz (0.07 GHz)
  <b>Memory Clock rate:                             13 Mhz</b>
  Memory Bus Width:                              64-bit

May I know your use-case.
Does the complex number is vary each time?
Or is there any reused data in your use-case?

Thanks.

qojote · January 13, 2017, 8:17am

As far as i know the TK1 GPU shares its memory with the system memory, so especially when using unified memory the access should be fast.

I need to evaluate data from a software defined radio. So the complex values are changing with each data packet. Because of high speed aquisition there is at most 80µs to evaluate the whole data (and there is more work to do then just calculating the absolute values). The CPU needs about 14µs to fulfill this task.

Heres my deviceQuery output:

Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.2
  Total amount of global memory:                 1927 MBytes (2020806656 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Clock rate:                                852 MHz (0.85 GHz)
  Memory Clock rate:                             924 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0

Honey_Patouceul · January 13, 2017, 8:28am

In order to use shared memory between cpu and gpu, have a look to this :
http://arrayfire.com/zero-copy-on-tegra-k1/

qojote · January 13, 2017, 8:40am

Thanks for the link. I already wrote a little program to compare zero-copy vs unified memory vs standard. In fact those hangups can be seen not matter how the data is managed.

AastaLLL · January 16, 2017, 8:15am

Hi,

Hangup caused by memory-related work, even for shared memory.

But I found something strange when profiling the kernel code.
https://drive.google.com/open?id=0B-fFMM_3Dj9JNmdMVVM5SHF0dkE
This file can be opened by nvvp, which is installed by JetPack2.3.1 with option “CUDA Toolkit for Ubuntu 14.04”

The execution time looks stable and (min,max) duration (20.365us,21.875us).
Could you help to profile your kernel code also?

Please use nvvp for profiling, which located at /usr/local/cuda/bin

$ nvvp -o /home/ubuntu/test.prof /path/to/your/code

qojote · January 16, 2017, 11:42am

Hi,

Is there any way to get nvvp without JetPack? I am not using the TK1 evaluation board and had no need for Jetpack yet?
Should i rund the nvvp command on host or device?

qojote · January 16, 2017, 2:53pm

Okay, i installed Jetpack, when i try to open the test.prof file nvvp complains that is cannot find libcuinj.so…

AastaLLL · January 17, 2017, 1:44am

NVVP is a tool contained in CUDAToolkit and located at ‘/usr/local/cuda/bin’.
You should have it if you already installed CUDAToolkit.

NVVP can run on both host and device. But it looks like that host profiling have some issue in your environment.
Could you try to profile the kernel code on device directly?

nvvp -o /home/ubuntu/test.prof /path/to/your/code

Sorry for the inconvenience.

qojote · January 17, 2017, 3:05pm

I cannot find nvvp in cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb which is use to get crosscompiled libraries for my system…

AastaLLL · January 26, 2017, 4:23am

Please run NVVP directly on the device.

$ cd /usr/local/cuda/libnvvp
$ nvvp -o /home/ubuntu/test.prof /path/to/your/code

qojote · January 26, 2017, 1:43pm

nvvp is only available at my host machine (XUbuntu 14.04) because i installed JetPack. As mentioned before i am using cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb to get all crosscompiled libraries i need. This package does not include nvvp.
So, where can i get the cross-compiled version of nvvp to run directly on device?

qojote · January 26, 2017, 1:55pm

Opening the test.prof with nvvp on the host machine fails with “Unsupported file type.”

AastaLLL · February 3, 2017, 9:41am

Nvvp should install by default with JetPack. We will investigate this issue. Sorry for the inconvenience.

For hang-up issue, I have rewritten your code with zerocpy and also switch to cuda timer for better profiling.

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <helper_timer.h>
#define ARRAY_SIZE 	4088
#define NUM_RUNS	50

int imin(int a, int b)
{
    return (a < b ? a : b);
}

const int _threadsPerBlock = 256;
const int _blocksPerGrid = imin(32, (ARRAY_SIZE/2+_threadsPerBlock-1)/_threadsPerBlock);

StopWatchInterface *timer = NULL;

void startTimer()
{
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
}

void endTimer(const char*str)
{
// cudaThreadSynchronize();
    sdkStopTimer(&timer);
    float elapsed_time = sdkGetTimerValue(&timer);
    printf("[%f] - %s\n", elapsed_time, str);
}

float getEndTimer(void)
{
    //cudaThreadSynchronize();
    sdkStopTimer(&timer);
    return sdkGetTimerValue(&timer);
}


// Calculate the abolsute values of the input array with complex numbers.
__global__ void absolute_values(float * d_out, float * d_in){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if(idx < ARRAY_SIZE/2)
    {
        float img = d_in[idx*2];
        float real = d_in[idx*2+1];
        d_out[idx] = img*img + real*real;
    }
}

int main(int argc, char ** argv)
{
    // Assume there is only one device
    cudaError_t err = cudaSetDevice(0);
    if (err != cudaSuccess) 
    {
        printf("Cuda Error: %d - %s\n", err, cudaGetErrorString(err));
        return -1;
    }
    cudaDeviceReset();
    sdkCreateTimer(&timer);

    // Generate host input and output array
    float* h_in = NULL;
    float* h_out = NULL;

    cudaHostAlloc( (void **)&h_in, ARRAY_SIZE, cudaHostAllocMapped);
    cudaHostAlloc( (void **)&h_out, ARRAY_SIZE/2, cudaHostAllocMapped);

    for (int i = 0; i < ARRAY_SIZE; i++)
        h_in[i] = float(i);

    // Declare GPU memory pointers
    float * d_in;
    float * d_out;

    // Get device pointer from host memory. No allocation or memcpy
    cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
    cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);

    float current = 0;
    float minimum = 0;
    float maximum = 0;
    float average = 0;
    printf("Blocks: %d a %d threads for %d values.\n", _blocksPerGrid, _threadsPerBlock, ARRAY_SIZE);

    startTimer();
    absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
    endTimer("first call");

    for(int i = 0; i < NUM_RUNS; i++)
    {
        startTimer();
        absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
        current = getEndTimer();
    		
        // Some statistics
        if( i==0 )
            maximum = minimum = current;

        if(current > minimum*3)
        {
            if(i != 0) printf(">>> ");
            printf("Run %d took %.2f us (three times minimum).\n", i, current*1000);
        }
        if(current < minimum || i == 0) minimum = current;
        if(current > maximum || i == 0) maximum = current;
        average += current;
    }

    printf("Minimum: %.2f us, Maximum: %.2f us, Average: %.2f us.\n", minimum*1000, maximum*1000, average/NUM_RUNS*1000);

    cudaFree(h_in);
    cudaFree(h_out);

    return 0;
}

This is tested on tx1, please remember to modify the compile commend into tk1.

/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -INVIDIA_CUDA-8.0_Samples/common/inc -m64 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o test.o -c topic_981452.cu
/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o test test.o  -L/usr/lib/"nvidia-361" -lGL -lGLU -lX11 -lglut

Results are much more stable.

ubuntu@tegra-ubuntu:~$ ./test
Blocks: 8 a 256 threads for 4088 values.
[0.211000] - first call
Minimum: 20.00 us, Maximum: 31.00 us, Average: 21.94 us.

kayccc · February 16, 2017, 5:29am

Hi qojote,

Have you tried the suggestion? Any improvement?

Thanks

qojote · February 17, 2017, 9:57am

Hi,

I will test your code as soon as my TK1 is working again with the newest BSP.

Thanks for your help and patience.

kayccc · March 2, 2017, 5:51am

Hi qojote,

Have you managed to test our code on your TK1?
Any result could be shared?

Thanks

qojote · March 6, 2017, 2:46pm

Hi,

Finally i managed to find some time to test your code.
Unfortunately the code does not compile at all:

/opt/gcc-linaro/gcc-linaro-5.2.1/arm-linux-gnueabihf/include/c++/5.2.1/bits/locale_classes.h(789) (col. 20): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r343/r343_00/drivers/compiler/edg/EDG_4.8/src/cp_gen_be.c", line 8408

1 catastrophic error detected in the compilation of "/tmp/tmpxft_00003922_00000000-4_main2.cpp4.ii".
Compilation aborted.
Aborted (core dumped)

This is the command which fails:

/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --compile --relocatable-device-code=false --target-cpu-architecture ARM -m32 -x cu -o "main2.o" "main2.cu"

I have seen this error before when i tried to use thrust (https://devtalk.nvidia.com/default/topic/981454/cuda-on-tk1-and-thrust/).

Topic		Replies	Views
Cuda code performance CUDA Programming and Performance	14	3118	December 16, 2014
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2928	October 18, 2021
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	51552	June 7, 2022
Outrageous speedups over serial code: advice needed CUDA Programming and Performance	17	2801	November 18, 2012
Cuda program taking more time. CUDA Programming and Performance	15	7059	November 21, 2010
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4590	February 15, 2018
CUDA very slow performance CUDA Programming and Performance	21	16562	March 6, 2020
Serial vs Parallel Performance CUDA Programming and Performance	5	901	April 3, 2017
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	555	June 3, 2024
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4411	February 7, 2012

CUDA hangups

Related topics