How to efficiently modify a very big sized device array using CUDA kernel ?

adit_bhrgv · December 30, 2017, 11:04pm

Hi

I want to modify a very big device array of size(NCH*W) efficiently. I tried below approach but it’s taking a lot of time in NVPROF. Can someone suggest a better solution ?

global void modifyArray( float* data_d, float a, b, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
float c = (255.0)*(128.0)/ a / b;
if(i < noOfElements)
{
data_d[i] = (data_d[i]/c);
}

}

…
…
Calling this kernel as below:

modifyArray<<<((NCHW)/1024) + 1, 1024>>>( data_d , a, b, NCHW);

Here, taking 1024 threads per block.

I am not very good at threads and blocks which here might do the trick very well.
Can anyone please suggest how to reduce the time taken by this kernel in NVPROF?

Thanks in advance!

Robert_Crovella · December 30, 2017, 11:27pm

__global__ void modifyArray( float* data_d, float factor, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < noOfElements)
  {
  data_d[i] *= factor);
  }

}

modifyArray<<<((N*C*H*W)/1024) + 1, 1024>>>( data_d , (b*a)/(255.0*128.0), N*C*H*W);

Your kernel at this point should be memory-bound.

njuffa · December 30, 2017, 11:32pm

The profiler should be telling you that your code is memory bound. With modern processors (in particular GPUs), this is often the case. Moving around data is expensive, doing computation is cheap, both in terms of execution time and in terms of energy expended. So what can you do? Increase the amount of computation performed per unit of data moved. Presumably there are other processing steps in your application besides simply scaling all the elements of this array by a particular factor. Examine how you can combine those with the scaling step, with a goal of minimizing overall data movement.

In terms of second-order effects, very large thread blocks rarely make the best use of hardware resources. This is due to the granularity at which various resources are allocated. A rule of thumb is to start with a block size of between 128 and 256 threads, and only change that if there is a solid indication that doing so is beneficial (usually that indication is given by appropriate experiments). In terms of optimizing use of the load/store unit resources of a GPU, it often pays off to make each memory access as wide as possible, in particular if all of the data is actually used. So instead of using an array of floats, consider using an array of float4.

Note that the Best Practices Guide that ships with CUDA will touch on these kinds of issues, so consider consulting it frequently.

Robert_Crovella · December 30, 2017, 11:40pm

picking up the suggestion around float4 by njuffa and a grid-striding loop:

__global__ void modifyArray( float4* data_d, float factor, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
while (i < noOfElements)
  {
  float4 temp = data_d[i];
  temp.x *= factor;
  temp.y *= factor;
  temp.z *= factor;
  temp.w *= factor;
  data_d[i] = temp;
  i += gridDim.x*blockDim.x;
  }

}

modifyArray<<<512, 1024>>>( (float4 *)data_d , (b*a)/(255.0*128.0), N*C*H*W/4);

This assumes NCH*W is whole-number divisible by 4, and that data_d has been allocated by e.g. cudaMalloc (without further adjustment), which will guarantee necessary alignment.

You can “shmoo” or vary-and-time the 512 and 1024 parameters, to see what is fastest for your GPU.

This kernel will probably be memory bandwidth bound. Instead of doing this scaling operation by itself, see if your overall workflow will allow you to tack it on to the end of the previous kernel, or add it to the beginning of the next one. This is called “fusion” of operations, and aims to minimize the number of loads and stores to complete a particular processing sequence.

adit_bhrgv · December 31, 2017, 12:42am

Thank you very much txbob and njuffa for your valuable suggestions.
The time spent in kernel is reduced now as per your suggestions.

Topic		Replies	Views
Just starting and with a question on a excercise CUDA Programming and Performance	6	1181	January 7, 2010
Urgent help with threads please! CUDA Programming and Performance	21	10799	March 6, 2008
Possible CUDA improvements CUDA Programming and Performance	7	6130	July 14, 2008
Probably a simple answer Simple CUDA code - unexpected result CUDA Programming and Performance	7	4868	October 27, 2010
Optimization for small kernels How to optimize small kernels with less instructions CUDA Programming and Performance	21	2416	October 22, 2010
Slow performance with long arrays CUDA Programming and Performance	3	961	February 15, 2013
thread / block allocation in function of data size CUDA Programming and Performance	5	4268	November 9, 2009
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	148	December 31, 2024
Performance degradation as task size grows CUDA Programming and Performance	13	650	April 25, 2023
uses too much local data / uses too much shared data CUDA Programming and Performance	1	1930	October 7, 2009

How to efficiently modify a very big sized device array using CUDA kernel ?

Related topics