Does CUDA automatically unroll loops?

Stuart_72 · September 12, 2011, 4:01pm

OK - a bit of background first. Essentially what I am trying to do is write a simple program which measures the length of time it takes CUDA to access a value stored in constant memory. In order to do this I have written the code below which creates the constant memory value, writes it to the device and then launches a kernel in which each thread will loop X number of times, reading the value from constant memory into a local variable.

However the problem that I’m having is that no matter how many times I loop inside the kernel, the results seem to remain the same. I have to conclude that either the loop is being unrolled by NVCC or there’s an error in my code somewhere and I haven’t caught it. I’m posting this here in the hope that someone looking at my code with a fresh pair of eyes might either spot the error or be able to tell me what is happening in the kernel.

I’ve tried varying the value of numKernelLoops from 10 to 10000000000000000 but have seen no difference in the timings. I understand that there is some caching involved in using constant memory but I can’t believe that the device would iterate through that many loops without some increase in the time taken

I’d be very grateful for any advice

My code in full:

#include <stdio.h>

#include <stdlib.h>

#include <cutil.h>

//////////////////////////////////////

////////// GLOBAL VARIABLES //////////

//////////////////////////////////////

#define numKrytenBlocks 1

#define numKrytenThreads 512

#define numDrygalskiBlocks 16

#define numDrygalskiThreads 512

#define numKernelLoops 1000000

__constant__ int c;

////////////////////////////////

////////// PROTOTYPES //////////

////////////////////////////////

__host__ void runTest(char hostInd);

__global__ void runkernel();

int main(int argc, char *argv[])

{

  //Check command line args are correct - should be 1 indicating if using Kryten or Drygalski

////////// CHECK COMMAND LINE ARGS ARE CORRECT //////////

  if(argc != 2)

    {

      printf("\n**** ERROR - MISSING COMMAND LINE ARGUMENT ****\n\n");

      printf("Usage: %s <host indicator (K or D)>\n\n",argv[0]);

      exit(EXIT_FAILURE);

    }

//Populate host variable with argv[1]

  char *hostarg = argv[1];

char hostInd = hostarg[0];

//Call the runtest function, passing in host variable 

  runTest(hostInd);

return EXIT_SUCCESS;

}

__host__ void runTest(char hostInd)

{

  ////////// VARIABLES //////////

//The number of blocks

  int blocks;

//The number of threads

  int threads;

//The cuda events for timing the kernel

  cudaEvent_t start, stop;

  cudaEventCreate(&start);

  cudaEventCreate(&stop);

//The time taken

  float elapsedTime;

////////// SET BLOCKS & THREADS VALUES //////////

if (hostInd == 'K')

    {

      blocks = numKrytenBlocks;

      threads = numKrytenThreads;

    }

  if (hostInd == 'D')

    {

      blocks = numDrygalskiBlocks;

      threads = numDrygalskiThreads;

    }

////////// CREATE CONSTANT MEMORY VARIABLE //////////

int temp_const = 42;

CUDA_SAFE_CALL(cudaMemcpyToSymbol("c", &temp_const, sizeof(int), 0, cudaMemcpyHostToDevice));

////////// START TIMING //////////

CUDA_SAFE_CALL(cudaEventRecord(start, 0));

////////// LAUNCH THE KERNEL //////////

runkernel<<<blocks,threads>>>();

////////// STOP TIMING //////////

CUDA_SAFE_CALL(cudaEventRecord(stop, 0));

  CUDA_SAFE_CALL(cudaEventSynchronize(stop));

  CUDA_SAFE_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));

printf("%.8f ms\n", elapsedTime);

}

__global__ void runkernel(void)

{

  ////////// VARIABLES //////////

//The loop counter

  long i;

//The local variable

  int local;

//A TEST VARIABLE

  int test = 10;

////////// LOOP THE LOOP //////////

//Inside the loop read the value in the constant memory variable into the local variable

  for (i = 0; i < numKernelLoops; i++)

    {

      local = c;

    }

}

njuffa · September 12, 2011, 4:41pm

The compiler contains unrollers, which are controlled by heuristics, and thus may or may not kick in for any given piece of code. If you need tighter control over unrolling, try

#pragma unroll N

where ‘N’ is the unroll count. #pragma unroll 1 will keep a loop rolled. I would suggest using cudaobjdump to look at the machine code generated to assess whether the generated code does what you want it to do, as PTX is just an intermediate format.

Stuart_72 · September 14, 2011, 2:24pm

Hi njuffa,

Thanks for your reply on this - I tried using

#pragma unroll 1

but this didn’t seem to make any difference. I haven’t used cuobjectdump before so will need to get to grips with that before I can analyse the generated code as you suggest, however something else has come to mind and I was wondering if you could advise on it? In the code I posted in my OP, I don’t actually do anything with the value that I am reading from constant memory inside the loop in the kernel. It has been suggested to me that the compiler may be identifying that this value is not being used and is optimising away the loops as a result of that - could that be the case?

avidday · September 14, 2011, 4:18pm

Your kernel code will be compiled to a null stub, there will be no loop at all. None of the calculations in the loop are written to memory, and the compiler is smart enough to recognise that the loop and its contents are redundant. Because of this it will optimize the entire thing away as dead code.

Stuart_72 · September 16, 2011, 9:31am

Thanks very much for confirming that for me, that’s a huge help

Topic		Replies	Views
Problem with unrolling loops CUDA Programming and Performance	9	8551	November 24, 2011
Loop unroll & remainder perf CUDA Programming and Performance cuda , performance , nvcc	6	2861	April 12, 2022
Odd problem with CUDA nested loop seems to not work CUDA Programming and Performance	3	11635	January 20, 2009
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1778	November 4, 2016
When I change the number of loops in the kernel function, but the running time of the kernel function is almost unchanged CUDA Programming and Performance	8	288	August 26, 2022
automatic loop unrolling CUDA Programming and Performance	8	11015	July 2, 2009
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42839	April 25, 2007
An question about a cuda program CUDA Programming and Performance	2	1137	June 13, 2013
Lock causes deadloop bug report CUDA Programming and Performance	6	3852	September 27, 2007
clock() doesn't work properly CUDA Programming and Performance	10	6291	July 3, 2009

Does CUDA automatically unroll loops?

Related topics