Newbie question about applying a scalar function to a matrix

Hi.

I’m a beginner with cuda, and have what I believe to be a rather simple. But I haven’t been able to find many clues.

I have a reasonable large matrix (say 2 000 * 6 000 values), and would like to replace each value with a scalar function (say tanh(x), where x is the current value).

I managed to get it to work, but it’s rather slow. I guess the memory bandwidth is the limiting factor. Is there any way to speed this up? I have considered trying to copy to private or shared memory, but I’m not sure how that should be done, or if it would be of any benefit.

Would be happy for any suggestion!

/Javerberg

Kernal:
global void DevFinalizeFinalLayerOutput(float* c, unsigned int width, unsigned int height)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < width * height)
{
c[tid] = tanhf(c[tid]);
tid += blockDim.x * gridDim.x;
}
}

Host:
int FinalizeFinalLayerOutput(float* dev, unsigned int width, unsigned int height)
{
const unsigned int noOfthreads = 128;
const unsigned int maxBlocks = 128;
unsigned int noOfBlocks = (width + noOfthreads - 1) / noOfthreads;
if (noOfBlocks > maxBlocks)
noOfBlocks = maxBlocks;
DevFinalizeFinalLayerOutput<<<noOfBlocks, noOfthreads>>>(dev, width, height);
return 0;
}

Can you say what times you are getting, how you are doing the measurement (CUDA timers, OS timers, etc), and what CUDA operations (initializing the device, memory allocation, etc) are being done?

With a kernel like this, there is no point in using shared memory since you are streaming the data directly from global memory in coalesced order and writing it back out again.

One thing I have recently learned (and I’ve been doing this for 5+ years) is that if you’re doing an operation on a ‘float’ and it doesn’t rely on neighbor values, you always get faster times operating on multiple values/thread for a slight increase in register usage. If your register used isn’t high already (which is definitely the case here), do a recasting of the input pointer to float4* and compute the four values at once (sometimes with the overloaded math functions from cutil_math.h, which you should replicate into your own structure and tweak as needed, for example as below or with complex types; don’t rely on nVidia to keep supporting it, 'cause, well it’s already not supported by them). You also may need to adjust the launching call and endpoint thread checks, and my assumption here is that you can guarantee a width that is a multiple of four or the endpoint checking gets a little ugly.

Note the device function in this case isn’t ‘overloaded’, it just makes the code a little cleaner below. You could certainly forgot that and just hard code it right in the kernel function. Additionally, you could merge several of the kernel lines together, but in the end, the compiler will essential do this for you as well.

I still don’t understand why this help as much as it does. I understand amortizing the function overhead and management into fewer threadblocks and fetch calls, but I assumed 4 32-bit fetches were the same as 1 128-bit fetch. This doesn’t appear to be the case from profiling.

This has worked well for me, but only on ‘simple’ arithmetic (mul/adds). The additional ops of the transcendental functions (like you have here) may not help you as much, but if you’re I/O limited (which you most likely are), it’s worth a try.

Tried to highlight changes in bold. Haven’t checked for accuracy; caveat lector.

inline device float4 tanhf4(float4 input) {
return make_float4(tanhf(input.x),tanhf(input.y),tanhf(input.z),tanhf(input.w));
}

Kernal:
global void DevFinalizeFinalLayerOutput(float* c, unsigned int width, unsigned int height)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid<<2 < width * height)
{
float4 newptr=(float4)c;
float4 oldval=newptr[tid];
newptr[tid] = tanhf4(oldval);

tid += blockDim.x * gridDim.x;
}
}

Host:
int FinalizeFinalLayerOutput(float* dev, unsigned int width, unsigned int height)
{
const unsigned int noOfthreads = 128;
const unsigned int maxBlocks = 128;
unsigned int noOfBlocks = (width>>2+ noOfthreads - 1) / noOfthreads;
if (noOfBlocks > maxBlocks)
noOfBlocks = maxBlocks;
DevFinalizeFinalLayerOutput<<<noOfBlocks, noOfthreads>>>(dev, width, height);
return 0;
}

The hardware needs to track each outstanding load. The hardware structures used for that have limited capacity. Therefore, one wide access can be more efficient than multiple narrow accesses, as less of the hardware structure is used for the same amount of load data.

Care needs to be exercised when casting a pointer of a narrow type to a pointer of a wider type. A float has a four byte alignment requirement, whereas as float4 has a 16 byte alignment requirement. The GPU requires all accesses to be aligned to natural boundaries.

Thank you for your response!

I have used the Nsight profiler tool, and noticed that the occupancy was only 6 percent, and that only one active warp was used. For a beginner like myself that sounds strange.
In my test I first made a matrix multiplication (101 x 153) * (153 * 300 000). The result is,
obviously, a matrix of size (101 x 300 000). The multiplication only takes about 25 ms (using cublas), but applying tanh on the result takes about 190 ms. This also sounds strange to me.

On the other hand, the main bottleneck is (or so I believe) data transfer to and from the card. The total time for all kernel calls is 500 ms, but the total time including transfer is almost 2 seconds. I will try to time the data transfer, as soon as I figure out a good way to do it.

I’m not much of a c programmer, and only wrote a small c dll (maybe 200 lines of code) to act as a wrapper. The main application is written in .net.

/Javerberg

Cool! Still have to do some adaptations (my code above was somewhat simplified), but my preliminary tests shows a huge difference! Down from 190 ms to 100 ms.

Thanks!

/Javerberg

What GPU are you using, and is tanh() invoked on single-precision or double precision data?

As others have already pointed out this portion of your code is likely memory bound. I did a quick experiment on a K20c. Single precision tanh() executes at around 30 GigaOperations/sec. Requiring four bytes of input and four bytes output it would need 240 GB/sec of useable bandwidth for the update. The K20c, with ECC turned on, provides around 150 Gb/sec of useable bandwidth (when using 64-bit accesses). Thus the matrix update would clearly be memory throughput limited on this GPU.

The only scenario I currently envision where the matrix update would be compute limited is if you use double-precision tanh() on a GPU with reduced double-precision throughput, e.g. consumer card or sm_1x device.

For the stated matrix size of 30.3 million elements 190ms sounds very high, and reported occupancy seems way too low. Try the following configuration: let each thread handle four elements, configure the thread block size to be 128 threads, then use as many thread blocks as needed to cover the 30.3 million elements (it should be around 59180 thread blocks, which can still be handled by a 1D grid).

What is included in the 190ms? Does this include transferring data between the host and the device? In that case you would be limited by the much lower PCIe bandwidth (currently limited to 6 GB/s on many platforms). The idea of GPU acceleration is to keep data resident on the GPU as long as possible, and to minimize host/device transfers.

Here is some code illustrating the approach I advocated above. Lots of blocks, lots of threads, and not much work done per thread. You could modify it to use float4 if desired, or try different thread counts per block (block sizes of 256 and 384 are probably worth trying depending on your platform).

#define func(x) tanh(x)

template <typename T>
__global__ void apply_kernel (T *c, int len)
{
    int totalThreads = gridDim.x * blockDim.x;
    int blockStart = blockDim.x * blockIdx.x;
    for (int i = blockStart + threadIdx.x; i < len; i += totalThreads) {
        c[i] = func (c[i]);
    }
}

void apply_func (float* c, int width, int height)
{
    dim3 dimBlock;
    int len = width * height; // assume rows are stored contiguously
    dimBlock.x = 128;         // heuristic
    int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520; // fit it into a 1D grid
    dim3 dimGrid (threadBlocks);
    apply_kernel<float><<<dimGrid,dimBlock>>>(c, len);
}

Good call on that. I tend to gloss over those things 'cause I’m usually dealing with 2D structures, and the pitch requirement is much more restrictive than the float4/uint4/int4/etc. requirements.

@MattWarmuth

In order to understand the difference between 32-bit and 128-bit access you have to understand the memory access patterns.

For all use cases assume the data array is 128 byte aligned.

  1. array of structures accessed as 4 32-bit accesses
  2. structure of arrays accessed as 4 32-bit accesses
  3. array of structures acessed as 1 128-bit access
Number of Transactions to Load 128-bits per Thread
CASE 1 16 transactions
CASE 2 4  transactions
CASE 3 4  transactions

CASE 2 and CASE 3 will give the best performance. Case 1 will likely have a high L1 cache hit rate on Fermi; however, Kepler does not cache L1 global memory loads.

CASE 1
__global__ aos32(float* data)
{
    int lane = __laneid();          // returns index in warp (0-31)
    float4 temp;

    temp.x = data[lane * 4 + 0];    // 4 transactions
    temp.x = data[lane * 4 + 1];    // 4 transactions
    temp.x = data[lane * 4 + 2];    // 4 transactions
    temp.x = data[lane * 4 + 3];    // 4 transactions
                                    // 16 transactions total
}

CASE 2
__global__ soa32(float* data)
{
    int lane = __laneid();          // returns index in warp (0-31)
    float4 temp;
    
    temp.x = data[lane + 0];        // 1 transactions
    temp.x = data[lane + 32];       // 1 transactions
    temp.x = data[lane + 64];       // 1 transactions
    temp.x = data[lane + 96];       // 1 transactions
                                    // 4 transactions total
}

CASE 3 aos128
__global__ load128(float4* data)
{
    int lane = __laneid();          // returns index in warp (0-31)
    float4 temp;
    temp = data[lane];              // 4 transactions
}

Table of Transactions

Each cell contains the bytes offset accessed by the thread during the transaction.

CASE 1
Transactions
Lane    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
0       0               4               8               12
1       16              20              24              28
2       32              36              40              44
3       48              52              56              60
4       64              68              72              76
5       80              84              88              92
6       96              100                 104             108
7       112             116                 120             124
8           128             132             136             140
9           144             148             152             156
10          160             164             168             172
11          176             180             184             188
12          192             196             200             204
13          208             212             216             220
14          224             228             232             236
15          240             244             248             252
16              256             260             264             268
17              272             276             280             284
18              288             292             296             300
19              304             308             312             316
20              320             324             328             332
21              336             340             344             348
22              352             356             360             364
23              368             372             376             380
24                  384             388             392             396
25                  400             404             408             412
26                  416             420             424             428
27                  432             436             440             444
28                  448             452             456             460
29                  464             468             472             476
30                  480             484             488             492
31                  496             500             504             508

CASE 2
Transactions
Lane    0   1   2   3
0       0   128 256 384
1       4   132 260 388
2       8   136 264 392
3       12  140 268 396
4       16  144 272 400
5       20  148 276 404
6       24  152 280 408
7       28  156 284 412
8       32  160 288 416
9       36  164 292 420
10      40  168 296 424
11      44  172 300 428
12      48  176 304 432
13      52  180 308 436
14      56  184 312 440
15      60  188 316 444
16      64  192 320 448
17      68  196 324 452
18      72  200 328 456
19      76  204 332 460
20      80  208 336 464
21      84  212 340 468
22      88  216 344 472
23      92  220 348 476
24      96  224 352 480
25      100 228 356 484
26      104 232 360 488
27      108 236 364 492
28      112 240 368 496
29      116 244 372 500
30      120 248 376 504
31      124 252 380 508

CASE 3
Transactions
Lane    0   1   2   3
0       0
1       16
2       32
3       48
4       64
5       80
6       96
7       112
8           128
9           144
10          160
11          176
12          192
13          208
14          224
15          240
16              256
17              272
18              288
19              304
20              320
21              336
22              352
23              368
24                  384
25                  400
26                  416
27                  432
28                  448
29                  464
30                  480
31                  496

I understand the reduced number of transactions, and that may have consequences on resources as njuffa stated previously, but I’m still a little fuzzy on what exactly a ‘transaction’ is. Is it a logical record in the CUDA hardware to keep track of a piece of information, or an actual memory bus transaction?

My assumption was that if I had 4 32-bit transactions, the hardware would be smart enough to pack those into one bus transaction if, for example, the device had a 128-bit bus. But I’m guessing that’s not the case after all this. The device truly does 4 32-bit transactions and ‘wastes’ 96-bits per bus transaction. If this is indeed the case, I think nVidia should broadcast that point a little more clearly (maybe they do and I just haven’t noticed/seen it). Specifically, they should say (if this statement is true) ‘If you only do 32-bit transactions per thread, you will not achieve full I/O utilization’. This is, a card rated at say 80 GB/sec will only yield 20 GB/sec for 32 bit transactions.

Again, I don’t know if what I stated in the previous paragraph is true, but if so I wish I would have known that years ago. As it stands, I’m still glad I know it now, and am very thankful to the folks at nVidia for making a great product, with some fantastic development tools. I started doing this when it first came out with the 8000 series, and it was kinda cool, but I didn’t have a problem that suited it at the time. Now, however, I’m trying to do real-time HD image processing, and it gets hard pushing that many pixels around all in real-time. Only wish my company would pony up for and HD-SDI ingest card. :)

Wow!

I tried your code, and got 100 percentage occupancy and the execution time got down to 4 ms. Quite an improvement! I have a lot to learn… Now it’s actually so fast I’m starting to suspect something is wrong. :-)

I will adapt this to my need, and check for correctness during the weekend. The adaptation should be rather simple; the only difference is that the right most column always should be set to 1.

To answer your other questions: I use single precision on a low to mid-range consumer card (650 Ti). This is a hobby project, and if I get it to work I’m ready to buy a faster card. But the professional gpgpu cards are out of my budget. On the other hand, it would be fun to try on one of Amazons gpu instances.

To time the kernel call I use Nsight profiler tool (extension to visual studio). As far as I understand this only includes the kernel call, and not the data transfer.

Thanks again!