How to efficiently modify a very big sized device array using CUDA kernel ?

Hi

I want to modify a very big device array of size(NCH*W) efficiently. I tried below approach but it’s taking a lot of time in NVPROF. Can someone suggest a better solution ?

global void modifyArray( float* data_d, float a, b, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
float c = (255.0)*(128.0)/ a / b;
if(i < noOfElements)
{
data_d[i] = (data_d[i]/c);
}

}



Calling this kernel as below:

modifyArray<<<((NCHW)/1024) + 1, 1024>>>( data_d , a, b, NCHW);

Here, taking 1024 threads per block.

I am not very good at threads and blocks which here might do the trick very well.
Can anyone please suggest how to reduce the time taken by this kernel in NVPROF?

Thanks in advance!

__global__ void modifyArray( float* data_d, float factor, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < noOfElements)
  {
  data_d[i] *= factor);
  }

}

modifyArray<<<((N*C*H*W)/1024) + 1, 1024>>>( data_d , (b*a)/(255.0*128.0), N*C*H*W);

Your kernel at this point should be memory-bound.

The profiler should be telling you that your code is memory bound. With modern processors (in particular GPUs), this is often the case. Moving around data is expensive, doing computation is cheap, both in terms of execution time and in terms of energy expended. So what can you do? Increase the amount of computation performed per unit of data moved. Presumably there are other processing steps in your application besides simply scaling all the elements of this array by a particular factor. Examine how you can combine those with the scaling step, with a goal of minimizing overall data movement.

In terms of second-order effects, very large thread blocks rarely make the best use of hardware resources. This is due to the granularity at which various resources are allocated. A rule of thumb is to start with a block size of between 128 and 256 threads, and only change that if there is a solid indication that doing so is beneficial (usually that indication is given by appropriate experiments). In terms of optimizing use of the load/store unit resources of a GPU, it often pays off to make each memory access as wide as possible, in particular if all of the data is actually used. So instead of using an array of floats, consider using an array of float4.

Note that the Best Practices Guide that ships with CUDA will touch on these kinds of issues, so consider consulting it frequently.

picking up the suggestion around float4 by njuffa and a grid-striding loop:

__global__ void modifyArray( float4* data_d, float factor, int noOfElements)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
while (i < noOfElements)
  {
  float4 temp = data_d[i];
  temp.x *= factor;
  temp.y *= factor;
  temp.z *= factor;
  temp.w *= factor;
  data_d[i] = temp;
  i += gridDim.x*blockDim.x;
  }

}

modifyArray<<<512, 1024>>>( (float4 *)data_d , (b*a)/(255.0*128.0), N*C*H*W/4);

This assumes NCH*W is whole-number divisible by 4, and that data_d has been allocated by e.g. cudaMalloc (without further adjustment), which will guarantee necessary alignment.

You can “shmoo” or vary-and-time the 512 and 1024 parameters, to see what is fastest for your GPU.

This kernel will probably be memory bandwidth bound. Instead of doing this scaling operation by itself, see if your overall workflow will allow you to tack it on to the end of the previous kernel, or add it to the beginning of the next one. This is called “fusion” of operations, and aims to minimize the number of loads and stores to complete a particular processing sequence.

Thank you very much txbob and njuffa for your valuable suggestions.
The time spent in kernel is reduced now as per your suggestions.