Does CUDA automatically unroll loops?

OK - a bit of background first. Essentially what I am trying to do is write a simple program which measures the length of time it takes CUDA to access a value stored in constant memory. In order to do this I have written the code below which creates the constant memory value, writes it to the device and then launches a kernel in which each thread will loop X number of times, reading the value from constant memory into a local variable.

However the problem that I’m having is that no matter how many times I loop inside the kernel, the results seem to remain the same. I have to conclude that either the loop is being unrolled by NVCC or there’s an error in my code somewhere and I haven’t caught it. I’m posting this here in the hope that someone looking at my code with a fresh pair of eyes might either spot the error or be able to tell me what is happening in the kernel.

I’ve tried varying the value of numKernelLoops from 10 to 10000000000000000 but have seen no difference in the timings. I understand that there is some caching involved in using constant memory but I can’t believe that the device would iterate through that many loops without some increase in the time taken

I’d be very grateful for any advice

My code in full:

#include <stdio.h>

#include <stdlib.h>

#include <cutil.h>


////////// GLOBAL VARIABLES //////////


#define numKrytenBlocks 1

#define numKrytenThreads 512

#define numDrygalskiBlocks 16

#define numDrygalskiThreads 512

#define numKernelLoops 1000000

__constant__ int c;


////////// PROTOTYPES //////////


__host__ void runTest(char hostInd);

__global__ void runkernel();

int main(int argc, char *argv[])


  //Check command line args are correct - should be 1 indicating if using Kryten or Drygalski


  if(argc != 2)


      printf("\n**** ERROR - MISSING COMMAND LINE ARGUMENT ****\n\n");

      printf("Usage: %s <host indicator (K or D)>\n\n",argv[0]);



//Populate host variable with argv[1]

  char *hostarg = argv[1];

char hostInd = hostarg[0];

//Call the runtest function, passing in host variable 




__host__ void runTest(char hostInd)


  ////////// VARIABLES //////////

//The number of blocks

  int blocks;

//The number of threads

  int threads;

//The cuda events for timing the kernel

  cudaEvent_t start, stop;



//The time taken

  float elapsedTime;

////////// SET BLOCKS & THREADS VALUES //////////

if (hostInd == 'K')


      blocks = numKrytenBlocks;

      threads = numKrytenThreads;


  if (hostInd == 'D')


      blocks = numDrygalskiBlocks;

      threads = numDrygalskiThreads;



int temp_const = 42;

CUDA_SAFE_CALL(cudaMemcpyToSymbol("c", &temp_const, sizeof(int), 0, cudaMemcpyHostToDevice));

////////// START TIMING //////////

CUDA_SAFE_CALL(cudaEventRecord(start, 0));

////////// LAUNCH THE KERNEL //////////


////////// STOP TIMING //////////

CUDA_SAFE_CALL(cudaEventRecord(stop, 0));


  CUDA_SAFE_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));

printf("%.8f ms\n", elapsedTime);


__global__ void runkernel(void)


  ////////// VARIABLES //////////

//The loop counter

  long i;

//The local variable

  int local;


  int test = 10;

////////// LOOP THE LOOP //////////

//Inside the loop read the value in the constant memory variable into the local variable

  for (i = 0; i < numKernelLoops; i++)


      local = c;



The compiler contains unrollers, which are controlled by heuristics, and thus may or may not kick in for any given piece of code. If you need tighter control over unrolling, try

#pragma unroll N

where ‘N’ is the unroll count. #pragma unroll 1 will keep a loop rolled. I would suggest using cudaobjdump to look at the machine code generated to assess whether the generated code does what you want it to do, as PTX is just an intermediate format.

Hi njuffa,

Thanks for your reply on this - I tried using

#pragma unroll 1

but this didn’t seem to make any difference. I haven’t used cuobjectdump before so will need to get to grips with that before I can analyse the generated code as you suggest, however something else has come to mind and I was wondering if you could advise on it? In the code I posted in my OP, I don’t actually do anything with the value that I am reading from constant memory inside the loop in the kernel. It has been suggested to me that the compiler may be identifying that this value is not being used and is optimising away the loops as a result of that - could that be the case?

Your kernel code will be compiled to a null stub, there will be no loop at all. None of the calculations in the loop are written to memory, and the compiler is smart enough to recognise that the loop and its contents are redundant. Because of this it will optimize the entire thing away as dead code.

Thanks very much for confirming that for me, that’s a huge help