CUDA hangups

Hello,

I want to use the GPU of the TK1 board to calculate the absolute values of complex numbers. When i run the kernel in a loop, the average execution time is about 300us (pretty slow anyway…). But sometimes the GPU needs several milliseconds. The worst thing i saw was 100ms for one run.
I am using CUDA toolkit 6.5 (cross compiling)
NVCC version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Wed_Aug_27_10:36:36_CDT_2014
Cuda compilation tools, release 6.5, V6.5.16

Compiling is done by hand:

/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -M -o "main.d" "main.cu"
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --compile --relocatable-device-code=false --target-cpu-architecture ARM -m32 -x cu -o "main.o" "main.cu"
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -L/opt/cuda-6.5-linaro/lib --cudart shared -Xlinker --unresolved-symbols=ignore-in-object-files --relocatable-device-code=false -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -ccbin arm-linux-gnueabihf-g++ -link -o "CUDAHang" main.o

This is the test code main.cu:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#define ARRAY_SIZE 	4088
#define NUM_RUNS	2000

int imin(int a, int b)
{
    return (a < b ? a : b);
}

const int _threadsPerBlock = 1024;
const int _blocksPerGrid = imin(32, (ARRAY_SIZE/2+_threadsPerBlock-1)/_threadsPerBlock);

typedef unsigned long long timestamp_t;
inline timestamp_t get_timestamp() {
    struct timeval now;
    gettimeofday(&now, NULL);
    return now.tv_usec + (timestamp_t) now.tv_sec * 1000000.0;
}

// Calculate the abolsute values of the input array with complex numbers.
__global__ void absolute_values(float * d_out, float * d_in){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
	// 
	if(idx < ARRAY_SIZE/2)
	{
		float img = d_in[idx*2];
		float real = d_in[idx*2+1];
		d_out[idx] = img*img + real*real;
	}
}

int main(int argc, char ** argv)
{
	// Assume there is only one device
	cudaError_t err = cudaSetDevice(0);
	if (err != cudaSuccess) 
	{
	  printf("Cuda Error: %d - %s\n", err, cudaGetErrorString(err));
	  return -1;
	}
	cudaDeviceReset();

	// Generate host input and output array
    float h_in[ARRAY_SIZE];		// Array with complex numbers
	float h_out[ARRAY_SIZE/2];	// Array with absolute values (half the size)
    for (int i = 0; i < ARRAY_SIZE; i++)
        h_in[i] = float(i);

    // Declare GPU memory pointers
    float * d_in;
    float * d_out;

    // Allocate GPU memory
    cudaMalloc((void**) &d_in, ARRAY_SIZE * sizeof(float));
    cudaMalloc((void**) &d_out, ARRAY_SIZE/2 * sizeof(float));

    // Transfer input array to GPU
    cudaMemcpy(d_in, h_in, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice);

	timestamp_t before;
	int current = 0;
	int minimum = 0;
	int maximum = 0;
	int average = 0;
	// Launch the kernel with threads only
	printf("Blocks: %d a %d threads for %d values.\n", _blocksPerGrid, _threadsPerBlock, ARRAY_SIZE);
	for(int i = 0; i < NUM_RUNS; i++)
	{
		before = get_timestamp();
		

		absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
		
		///>>> Synchronization can be done by copying data or by an explicit call... hangups in both cases
		//cudaDeviceSynchronize();
		cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

		// Some statistics
		current = (get_timestamp() - before);
		if(current > minimum*3)
		{
    		if(i != 0)	printf(">>> ");
			printf("Run %d took %d us (three times minimum).\n", i, current);
		}
		if(current < minimum || i == 0)	minimum = current;
		if(current > maximum || i == 0)	maximum = current;
		average += current;
	}

    // Copy back the result array to the CPU
    cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

	printf("Minimum: %d us, Maximum: %d us, Average: %d us.\n", minimum, maximum, average/NUM_RUNS);

    cudaFree(d_in);
    cudaFree(d_out);

    return 0;
}

Thanks for help!

Hi,

I tried to profile kernel part without memcpy (since memcpy may related to IO control).
Although sometime it really takes longer, average execution time is acceptable and close to minimal case.

Minimum: 22 us, Maximum: 1646 us, Average: 35 us.

absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
current = (get_timestamp() - before);

///>>> Synchronization can be done by copying data or by an explicit call... hangups in both cases
//cudaDeviceSynchronize();
cudaMemcpy(h_out, d_out, ARRAY_SIZE/2 * sizeof(float), cudaMemcpyDeviceToHost);

Hello,

Thanks for your profiling work. I am wondering why some runs need significantly more thime then the average (50x)? I need to evaluate data constantly so this random delay could be critical for me.

Hi,

In this case, cuda are busy in accessing global memory but only do a few work.
This make IO/thread >> work/thread. Which may lead to hangup when waiting IO.

Also profiled main.cu by tegrastats

sudo ./tegrastats

RAM 1225/3995MB (lfb 480x4MB) cpu [98%,0%,0%,100%]@1734 EMC 1%@1600 AVP 1%@80 NVDEC 268 MSENC 268 GR3D 23%@998 EDP limit 1734
GPU utilization is not high, may indicate gpu are waiting for other jobs.

More, you can run device query to check gpu memory rate.

ubuntu@tegra-ubuntu:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery

My tx1:

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3995 MBytes (4188778496 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            72 MHz (0.07 GHz)
  <b>Memory Clock rate:                             13 Mhz</b>
  Memory Bus Width:                              64-bit

May I know your use-case.
Does the complex number is vary each time?
Or is there any reused data in your use-case?

Thanks.

As far as i know the TK1 GPU shares its memory with the system memory, so especially when using unified memory the access should be fast.

I need to evaluate data from a software defined radio. So the complex values are changing with each data packet. Because of high speed aquisition there is at most 80µs to evaluate the whole data (and there is more work to do then just calculating the absolute values). The CPU needs about 14µs to fulfill this task.

Heres my deviceQuery output:

Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.2
  Total amount of global memory:                 1927 MBytes (2020806656 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Clock rate:                                852 MHz (0.85 GHz)
  Memory Clock rate:                             924 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0

In order to use shared memory between cpu and gpu, have a look to this :
http://arrayfire.com/zero-copy-on-tegra-k1/

Thanks for the link. I already wrote a little program to compare zero-copy vs unified memory vs standard. In fact those hangups can be seen not matter how the data is managed.

Hi,

Hangup caused by memory-related work, even for shared memory.

But I found something strange when profiling the kernel code.
https://drive.google.com/open?id=0B-fFMM_3Dj9JNmdMVVM5SHF0dkE
This file can be opened by nvvp, which is installed by JetPack2.3.1 with option “CUDA Toolkit for Ubuntu 14.04”

The execution time looks stable and (min,max) duration (20.365us,21.875us).
Could you help to profile your kernel code also?

Please use nvvp for profiling, which located at /usr/local/cuda/bin

$ nvvp -o /home/ubuntu/test.prof /path/to/your/code

Hi,

Is there any way to get nvvp without JetPack? I am not using the TK1 evaluation board and had no need for Jetpack yet?
Should i rund the nvvp command on host or device?

Okay, i installed Jetpack, when i try to open the test.prof file nvvp complains that is cannot find libcuinj.so…

NVVP is a tool contained in CUDAToolkit and located at ‘/usr/local/cuda/bin’.
You should have it if you already installed CUDAToolkit.

NVVP can run on both host and device. But it looks like that host profiling have some issue in your environment.
Could you try to profile the kernel code on device directly?

nvvp -o /home/ubuntu/test.prof /path/to/your/code

Sorry for the inconvenience.

I cannot find nvvp in cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb which is use to get crosscompiled libraries for my system…

Please run NVVP directly on the device.

$ cd /usr/local/cuda/libnvvp
$ nvvp -o /home/ubuntu/test.prof /path/to/your/code

nvvp is only available at my host machine (XUbuntu 14.04) because i installed JetPack. As mentioned before i am using cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb to get all crosscompiled libraries i need. This package does not include nvvp.
So, where can i get the cross-compiled version of nvvp to run directly on device?

Opening the test.prof with nvvp on the host machine fails with “Unsupported file type.”

Nvvp should install by default with JetPack. We will investigate this issue. Sorry for the inconvenience.

For hang-up issue, I have rewritten your code with zerocpy and also switch to cuda timer for better profiling.

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <helper_timer.h>
#define ARRAY_SIZE 	4088
#define NUM_RUNS	50

int imin(int a, int b)
{
    return (a < b ? a : b);
}

const int _threadsPerBlock = 256;
const int _blocksPerGrid = imin(32, (ARRAY_SIZE/2+_threadsPerBlock-1)/_threadsPerBlock);

StopWatchInterface *timer = NULL;

void startTimer()
{
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
}

void endTimer(const char*str)
{
// cudaThreadSynchronize();
    sdkStopTimer(&timer);
    float elapsed_time = sdkGetTimerValue(&timer);
    printf("[%f] - %s\n", elapsed_time, str);
}

float getEndTimer(void)
{
    //cudaThreadSynchronize();
    sdkStopTimer(&timer);
    return sdkGetTimerValue(&timer);
}


// Calculate the abolsute values of the input array with complex numbers.
__global__ void absolute_values(float * d_out, float * d_in){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if(idx < ARRAY_SIZE/2)
    {
        float img = d_in[idx*2];
        float real = d_in[idx*2+1];
        d_out[idx] = img*img + real*real;
    }
}

int main(int argc, char ** argv)
{
    // Assume there is only one device
    cudaError_t err = cudaSetDevice(0);
    if (err != cudaSuccess) 
    {
        printf("Cuda Error: %d - %s\n", err, cudaGetErrorString(err));
        return -1;
    }
    cudaDeviceReset();
    sdkCreateTimer(&timer);

    // Generate host input and output array
    float* h_in = NULL;
    float* h_out = NULL;

    cudaHostAlloc( (void **)&h_in, ARRAY_SIZE, cudaHostAllocMapped);
    cudaHostAlloc( (void **)&h_out, ARRAY_SIZE/2, cudaHostAllocMapped);

    for (int i = 0; i < ARRAY_SIZE; i++)
        h_in[i] = float(i);

    // Declare GPU memory pointers
    float * d_in;
    float * d_out;

    // Get device pointer from host memory. No allocation or memcpy
    cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
    cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);

    float current = 0;
    float minimum = 0;
    float maximum = 0;
    float average = 0;
    printf("Blocks: %d a %d threads for %d values.\n", _blocksPerGrid, _threadsPerBlock, ARRAY_SIZE);

    startTimer();
    absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
    endTimer("first call");

    for(int i = 0; i < NUM_RUNS; i++)
    {
        startTimer();
        absolute_values<<<_blocksPerGrid, _threadsPerBlock>>>(d_out, d_in);
        current = getEndTimer();
    		
        // Some statistics
        if( i==0 )
            maximum = minimum = current;

        if(current > minimum*3)
        {
            if(i != 0) printf(">>> ");
            printf("Run %d took %.2f us (three times minimum).\n", i, current*1000);
        }
        if(current < minimum || i == 0) minimum = current;
        if(current > maximum || i == 0) maximum = current;
        average += current;
    }

    printf("Minimum: %.2f us, Maximum: %.2f us, Average: %.2f us.\n", minimum*1000, maximum*1000, average/NUM_RUNS*1000);

    cudaFree(h_in);
    cudaFree(h_out);

    return 0;
}

This is tested on tx1, please remember to modify the compile commend into tk1.

/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -INVIDIA_CUDA-8.0_Samples/common/inc -m64 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o test.o -c topic_981452.cu
/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o test test.o  -L/usr/lib/"nvidia-361" -lGL -lGLU -lX11 -lglut

Results are much more stable.

ubuntu@tegra-ubuntu:~$ ./test
Blocks: 8 a 256 threads for 4088 values.
[0.211000] - first call
Minimum: 20.00 us, Maximum: 31.00 us, Average: 21.94 us.

Hi qojote,

Have you tried the suggestion? Any improvement?

Thanks

Hi,

I will test your code as soon as my TK1 is working again with the newest BSP.

Thanks for your help and patience.

Hi qojote,

Have you managed to test our code on your TK1?
Any result could be shared?

Thanks

Hi,

Finally i managed to find some time to test your code.
Unfortunately the code does not compile at all:

/opt/gcc-linaro/gcc-linaro-5.2.1/arm-linux-gnueabihf/include/c++/5.2.1/bits/locale_classes.h(789) (col. 20): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r343/r343_00/drivers/compiler/edg/EDG_4.8/src/cp_gen_be.c", line 8408

1 catastrophic error detected in the compilation of "/tmp/tmpxft_00003922_00000000-4_main2.cpp4.ii".
Compilation aborted.
Aborted (core dumped)

This is the command which fails:

/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc -g -O3 -ccbin arm-linux-gnueabihf-g++ -gencode arch=compute_32,code=sm_32 --compile --relocatable-device-code=false --target-cpu-architecture ARM -m32 -x cu -o "main2.o" "main2.cu"

I have seen this error before when i tried to use thrust (https://devtalk.nvidia.com/default/topic/981454/cuda-on-tk1-and-thrust/).