GPIO Bit-Bang Speed Increase


Using the gpio_set_value() APIs in a kernel module. I still observe that toggling the pins on the TX2 is much slower than on the TX1. Additionally the APIs appear to be slower than userspace mmap. It would be useful to know what has changed between TX1 and TX2 gpios and whether there is a better way to toggle gpios quickly or that we are actually limited to the speed observed.

Hey Jerry.

That is what i was mentioning i tried in post #4 on this thread to no avail.

Here is the loop i integrated along with a sample application which wrote my file contents over to a kernel buffer. My write_data_to_pins function is bound to the file_ops write: function.

int write_data_to_pins(void)
	int i=0;
	int j=0;
	char c;
	int gpioWrite=0;
	for (i=0; i<fpga_data.size; i++)
		c =[i];
		for(j=0; j<8;j++)

			gpioWrite = c & 0x01;

				gpio_set_value(gpioData, 1);
				gpio_set_value(gpioData, 0);

			gpio_set_value(gpioDclk, 1);
			c=c>>1;	//bitshift right the char.

			gpio_set_value(gpioDclk, 0);
			printk("i: %i\n",i);	
	return 1;

Doing this resulted in the same throughput (if not slower) on the data. akmal.ali mentioned as he had the same result.

hi all,

we will have internal investigation, could you please also share the evaluation results between TX1 and TX2.
it would be better to have side-by-side comparison to indicate same code to toggle GPIO pins, and TX1 having much faster results.

For the kernel APIS, I added the following to a kernel module to simply test toggle speed.

static const int            FpgaProgramGpioNumTx1   = 187;
static const int            FpgaProgramGpioNumTx2   = 388;

static void

    int FpgaProgramGpioNum = 0;
    if( 0 == gpio_request( FpgaProgramGpioNumTx1, "FpgaCLK" ) )
	FpgaProgramGpioNum  = FpgaProgramGpioNumTx1;
    else if( 0 == gpio_request( FpgaProgramGpioNumTx2, "FpgaCLK" ) )
	FpgaProgramGpioNum  = FpgaProgramGpioNumTx2;
    printk(KERN_ALERT "GPIO: %d \n",FpgaProgramGpioNum  );

    if( 0 != FpgaProgramGpioNum)
	    printk(KERN_ALERT "GPIO: Start Bitbanging\n" );
	    for (int i = 0 ; i < 20 * 1000 * 1000 ; ++i)
		gpio_set_value(FpgaProgramGpioNum , 0);
		gpio_set_value(FpgaProgramGpioNum , 1);
	    printk(KERN_ALERT "GPIO: Finish Bitbanging! \n" );


On the TX1, this runs in 10 seconds indicating 2MHz toggle speed.
On the TX2, this runs in 59 seconds indicating 0.33MHz toggle speed.

For the userspace mmapped registers,
the TX1 I can toggle at 11MHz ,
but the TX2 only at 0.5MHz.

In addition for the userspace code, The time taken by the TX1 can be directly correlate with cpu frequency, but the time taken by the TX2 to toggle the pins is invariant to cpu frequency.


Glad to hear you are going to try and replicate internally. Let me know if I can provide any additional details but akmal.ali seems to have provided enough detail. I am getting similar results for speed to what he has measured according to my signal analyzer.

I personally never worked w/ the bitbang on TX1 like akmal.ali has. I only have worked on the TX2 with the pins and am not getting the performance i expect.

Also in case it is relevant i am using the following pins for the bitbang which differ from akmal.ali’s pins.

GPIO3_PI.05 == 389
GPIO3_PB.04 == 332

Hi all,

The delay is because of two register writes in tx2 code compared to tx1. The control register value needs to be updated only while setting the direction to output. It is not required to be overwritten every time the value is changed.

You need the following change in kernel/nvidia to get the required perf results.

--- a/drivers/gpio/gpio-tegra186.c
+++ b/drivers/gpio/gpio-tegra186.c
@@ -1079,7 +1079,6 @@ static void tegra_gpio_set(struct gpio_chip *chip, unsigned offset, int value)
        u32 val = (value) ? 0x1 : 0x0;
        tegra_gpio_writel(tgi, val, offset, GPIO_OUT_VAL_REG);
-       tegra_gpio_writel(tgi, 0, offset, GPIO_OUT_CTRL_REG);
 static int tegra_gpio_get(struct gpio_chip *chip, unsigned offset)
@@ -1129,6 +1128,7 @@ static int tegra_gpio_direction_output(struct gpio_chip *chip, unsigned offset,
        int ret;
        tegra_gpio_set(chip, offset, value);
+       tegra_gpio_writel(tgi, 0, offset, GPIO_OUT_CTRL_REG);
        set_gpio_direction_mode(chip, offset, 1);
        tegra_gpio_enable(tgi, offset);
        ret = pinctrl_gpio_direction_output(chip->base + offset);


Having tried the change, I find the userspace mmapped code to be unaffected.
The kernel-space code is slightly faster. 46 seconds instead of 59s. i.e. 0.43MHz compared to 0.33MHz previously.

However toggling gpio pins on the TX1 is much faster (2MHz from kernel space) and 11MHz from userspace.

It does seem like simply reading/writing the gpio registers on the TX2 is slower.

Just wondering about what performance you are getting on the TX2?


Any updates on this?

We are investigating the gap in performance. The memory speeds are similar in the both the boards. However, there a small increase in performance with running which sets the board to max-performance mode.

Can you try running your application attached to A57 core only?

taskset 0x6


I assume you mean Denver core, as the denver cores are cores 1 and 2 corresponding to a mask of 0x6. regardless, I have tried both taskset 0x01, 0x02, 0x04 and have changed the priority of the thread to SCHED_RR. I haven’t found this to change the speed at which I can toggle GPIOs however. Speed is the same whether I use Denver core or A57 cores.


We have compared GPIO bit toggling performance between TX1 & TX2. We are seeing no difference in their performance. Attaching gpio_perf_debugfs script which toggles the bit value. Attaching script and results which are seen.
In gpio_perf_debugfs script,you will need to provide gpio and count values. For ex.
In TX1: gpio=187 count=100000
In TX2: gpio=387 count=100000
You can use any gpio number which is not being used by the driver.


echo "GPIO Performance Analysis Test"
echo -e "******************************\n"


echo $gpio > /sys/class/gpio/export
if [ $? -eq 0 ]; then
        echo "Exported GPIO-$gpio"
        echo "Failed to export GPIO-$gpio"
        echo $gpio > /sys/class/gpio/unexport
        echo "Unexported GPIO-$gpio"
        echo $gpio > /sys/class/gpio/export
        echo "Exported GPIO-$gpio"

echo out > /sys/class/gpio/gpio$gpio/direction

echo "Started Bit-Banging"
for i in $(seq 1 $count)
                echo 1 > /sys/class/gpio/gpio$gpio/value > /dev/null
                echo 0 > /sys/class/gpio/gpio$gpio/value > /dev/null
echo "Ended Bit-Banging"
echo -e "\tToggled $i times"
echo -e "\tTotal Time taken $SECONDS"
if [ $SECONDS -eq 0 ]; then
        echo "Increase count"
        echo -e "frequency of the toggle $(( $i / $SECONDS ))" 
echo $gpio > /sys/class/gpio/unexport
echo -e "Unexported GPIO-$gpio\n"
echo "******************************"

Thanks & Regards,

Bit_banging_results.txt (4.48 KB)

Your script uses SysFs to toggle GPIOs and hence the toggle speed is limited by the overhead.
The script you are using only show frequencies of ~15 kHz. Whereas the difference I am seeing, involves writing to memory-mapped registers. On the TX1 I can toggle at 11 MHz . On the TX2 I can only toggle at 0.5 MHz.

Given I can toggle gpios at 0.5MHz, the test which only goes up to ~ 15 kHz doesn’t show the difference.

Also invested in this, can NVIDIA please provide clarity.


Any updates on this? Would be good to know whether you expect that Gpio toggle speeds can be increased or whether the observed performance is the actual device limit.

Sorry for the late response.
HW design is quiet different from TX1 to TX2. We are looking into alternative approach to help you out.
Can you give more details on your usecase?
What usecase you are trying to generate using bit banging and what are the requirements?

The usecase for Gpio bit-banging is using to the gpio pins to serially program an FPGA. This is a startup type problem for our product. On the TX1, programming the FPGA using this method takes 6 seconds and is fast enough. On the TX2, programming the FPGA using the same method takes 140 seconds. ~20 times slower.

Requirements are to perform this bit-banging as fast as possible. On the TX1, this means bitbanging at ~ 10 MHz. On the TX2, we are currently limited to 0.5MHz. It would be good be able to also bitbang at a similar speed to TX1.

More details:
We are programming a Xilinx fpga using Slave serial mode. This involves toggling a clock line and a data line to load the configuration bitstream to the FPGA.

thanks for the info.
While we are checking this.
Can you try below clock increase to get better result

root@tegra-ubuntu:/sys/kernel/debug/bpmp/debug/clk/axi_cbb# cat rate

root@tegra-ubuntu:/sys/kernel/debug/bpmp/debug/clk/axi_cbb# echo 409600000 > rate
or whatever is the max rate

root@tegra-ubuntu:/sys/kernel/debug/bpmp/debug/clk/axi_cbb# echo 1 > mrq_rate_locked

1 Like

Thanks for checking this,

Applying the clock increase has indeed improved performance.
Time to program FPGA before change (136 seconds).
Time to program FPGA after change ( 87 seconds).

echo 409600000 > /sys/kernel/debug/bpmp/debug/clk/axi_cbb/rate 

echo 1 > /sys/kernel/debug/bpmp/debug/clk/axi_cbb/mrq_rate_locked

Are there any drawbacks of applying this change that we should be aware of?


This increase the clock transfer rate for the backbone clock