My aim is to split a 1280x960 image of ints into small 4x4 regions and find the maximum pixel value in each 4x4 region (with a margin around the edge of the overall image). There is a second processing stage on each 4x4 region but I omit that for now. I hoped CUDA (running on a Jetson TX1) would help speed this up…
My initial design is as follows:
Split the image up into 16x30 overlapping blocks of 96x38 pixels. So 480 blocks in total – each processing 3648 pixels.
Then I used 20x8 (160 total) threads to load the 3648 pixels into shared memory. So 3648x4 = 14592 bytes. Each block can then be split up into 160 4x4 regions – so each thread find a maximum in its own 4x4 region.
I thought this would work well because the memory loads can be coalesced (96 is a multiple of 32), 160 threads can be used for loading AND splitting with hardly any sitting idle for long and total threads is a multiple of 32.
I have this design running, and a reference CPU version running as well. Unfortunately, the CPU version is twice as fast as the GPU version…. A bit of profiling reports low occupancy and it might be to do with using too much shared memory per block. Could anyone perhaps elaborate on this, or suggest a better design?
The Jetson TX1 uses the same physical memory to serve both CPU and GPU, I would think? And this appears to be a task that is bound by memory throughput since there is very little computation. Assuming the CPU and the GPU access the memory with the same throughput (I don’t know whether that is true, worth measuring with the STREAM benchmark), I would expect no speedup from doing the processing on the GPU.
You don’t say how low the occupancy is, but given the shared memory usage per thread block, it is certainly going to limit occupancy somewhat. Try cutting shared memory usage per thread block in half. Also, double check the access patterns for global and shared memory access. The profiler can help you determine how efficient these accesses are (coalescing, bank conflicts).
I guess I’m a bit confused now.
Perhaps wrongly I assumed the CPU had its own RAM, and the GPU had its own memory - global (which any block and thread could access) and shared (which can be shared by threads in the same block).
Anyhow, my timings don’t include copying memory from host to device (which I had assumed was transfer between CPU RAM and GPU global memory). However, the timings do include the copies between what I had assumed was GPU global memory to GPU shared memory (and are then shared between each block’s threads.)
Please could you explain how you think it is setup?
So as a relative newbie, I find the profiler a bit confusing tbh.
So the CUDA Application Analysis reports:
Low memcpy/compute overlap (0%),
low kernel concurrency (0%),
low memcpy throughput (1.073 GB/sa avg, for memcpys accounting for 90.5% of all memcpy time).
The complete utilization graph looks to be down at about 30% throughout.
When I examine individual kernels (I only have one here):
Efficiency:
Global load efficiency = 100%
Global store efficiency = 15.8% (!)
Shared efficiency = 37.8% (!)
Warp execution efficiency = 71.5% (!)
Non-predicated warp execution efficiency = 70% (!)
Occupancy:
Achieved = 29.1% (!)
Theoretical = 46.9%
Limiter = Shared Memory
When I perform kernel analysis it says:
Kernel performance is bound by instruction and memory latency.
There is then a bar chart that shows the compute utilisation at less than 10% (the legend shows most of this 10% consists of arithmetic operations) and memory utilisation about 5%.
Perhaps you would be kind enough to suggest what this shows?
As I said, the kernel should be bound by memory throughput, as only 15 max() calls that map directly to IMAX instructions are needed to find the maximum value out of 4x4 = 16 pixels, so I see no reason to worry about computation.
Global load efficiency is important and is 100%, that is good. Since stores are typically “fire and forget”, and this should be applicable to your scenario (they are used for writing out final results), I wouldn’t worry about that, for now. Revisit the store access pattern later.
The occupancy looks worryingly low, and it is caused by the shared memory usage per block being too high. Cut it in half.
I don’t know why there are memcpy() calls in the code. The Jetson TX 1 has a unified physical memory, does it not? I haven’t used one of these parts, but given the unified physical memory I do not see the need for copying, that is just adding overhead.
Working with the profiler will be a feedback loop. It suggests a bottleneck, you change the code to eliminate the bottleneck, now you have a new bottleneck, etc.