GPIO Bit-Bang Speed Increase

Hey. I’m bitbang programming an external device over GPIO pins and am inquiring if anyone has any tips for speeding up the process. The device can only be programmed over this 2 pin process.

The bitbang process is a simple never ending 2 pin clock + data line. Similar to I2C except it runs forever indefinitly as a write. So no ACKs or anything.

I send the contents of a file bit by bit over this 2 line transfer.

Currently I am doing so by memory mapping the GPIO address space:

TegraMainGpio_X2RegBase   = (uint8_t *)mmap(
							NULL,
							0x10000,
							PROT_WRITE|PROT_READ,
							MAP_SHARED,
							FD,
							0x2210000);

I then map the output registers for each of my pins and write to them in this main loop:

while (file.good())         
{
	unsigned char c = file.get();       
	if (file.good())
	{
		for(int i=0; i<8;i++)
		{
				gpioWrite = c & 0x1;
				if(gpioWrite>0)
				{
					*data_value_OutReg =1;	
				}
				else
				{
					*data_value_OutReg =0;	
				}
				*data_clk_OutReg = 1;
				c=c>>1;

				*data_clk_OutReg = 0;
			}
		}
		count++;
		if(count%10000==0)
			printf("Count: %i\n",count);
	}
}

Unfortunately my 5.5MB file takes just under 5 minutes to program. So it seems im getting a throughput of about 18KB/s. I made the assumption that mmap would be the fastest way to access the gpio pins. I’ve tried using open/read/write with the sysfs gpio mapping and is was slower.

I already looked at this post but it didn’t help my situation at all.

I tried changing some of the mmap parameters to no avail. Tried setting it to MAP_PRIVATE but then the pins wouldn’t toggle.

Anyone have any ideas? Am I going about this the wrong way?

Hi, I also have the same problem on the TX2 for a similar purpose. On the TX1, the same code to toggle GPIO pins is much faster.
See:
https://devtalk.nvidia.com/default/topic/1041993/bitbanging-gpio-lines-on-tx2/?offset=3#5285567

Could anyone from Nvidia comment on the difference observed and/or the best way to set/clear gpio pins?

Hey Akmal.ali

Yeah i actually referenced your thread for help on memory mapping the pins (so thank you). So i’ve also made sure my pointers are volatile and everything.

My actual loop and code (sans the memory mapping part of course) was actually originally written for a very old ti davinci processor and it was performing much faster. That and given your experience with the speed of the TX1 running the same code indicates to me that the code should be fine and its something specific with the TX2.

Hopefully Nvidia can give some additional insight.

Not that i thought it would help all that much, but i did attempt to do this in kernel space oppossed to user space with mmap by using the linux gpio functions included by “#include <linux/gpio.h>”

I got the same results in speed so both must be using the same path.

Additionally its good to note that when running the gpio bitbang tegrastats reports:

CPU [0%@1996,100%@2035,0%@2035,0%@1996,0%@1997,0%@1996]

with that single core being used to the max.

I played around with moving the task onto different cores using taskset, but they all behave the same (maxing at 100%) with the same speed.

My operation is obviously serial since it a bitbang operation. So multiple cores won’t help. But obviously something is bottlenecking with the GPIOs. Is there anyway to get lower level access to them?

hi all,

could you please refer to Topic 1009932, there’s gpio_get_value() and gpio_set_value() APIs in kernel driver to control gpio. thanks

@JerryChang

Using the gpio_set_value() APIs in a kernel module. I still observe that toggling the pins on the TX2 is much slower than on the TX1. Additionally the APIs appear to be slower than userspace mmap. It would be useful to know what has changed between TX1 and TX2 gpios and whether there is a better way to toggle gpios quickly or that we are actually limited to the speed observed.

Hey Jerry.

That is what i was mentioning i tried in post #4 on this thread to no avail.

Here is the loop i integrated along with a sample application which wrote my file contents over to a kernel buffer. My write_data_to_pins function is bound to the file_ops write: function.

int write_data_to_pins(void)
{
	int i=0;
	int j=0;
	char c;
	int gpioWrite=0;
	for (i=0; i<fpga_data.size; i++)
  	{
		c = fpga_data.data[i];
		for(j=0; j<8;j++)
		{

			gpioWrite = c & 0x01;

			if(gpioWrite>0)
			{
				gpio_set_value(gpioData, 1);
			}
			else
			{
				gpio_set_value(gpioData, 0);
			}

			gpio_set_value(gpioDclk, 1);
			c=c>>1;	//bitshift right the char.

			gpio_set_value(gpioDclk, 0);
		}
		
		if(i%10000==0)
		{
			printk("i: %i\n",i);	
		}
	}
	return 1;
}

Doing this resulted in the same throughput (if not slower) on the data. akmal.ali mentioned as he had the same result.

hi all,

we will have internal investigation, could you please also share the evaluation results between TX1 and TX2.
it would be better to have side-by-side comparison to indicate same code to toggle GPIO pins, and TX1 having much faster results.
thanks

For the kernel APIS, I added the following to a kernel module to simply test toggle speed.

static const int            FpgaProgramGpioNumTx1   = 187;
static const int            FpgaProgramGpioNumTx2   = 388;

static void
gpioBitBangTest(
    void
    )
{


    int FpgaProgramGpioNum = 0;
    if( 0 == gpio_request( FpgaProgramGpioNumTx1, "FpgaCLK" ) )
    {
	FpgaProgramGpioNum  = FpgaProgramGpioNumTx1;
    }
    else if( 0 == gpio_request( FpgaProgramGpioNumTx2, "FpgaCLK" ) )
    {
	FpgaProgramGpioNum  = FpgaProgramGpioNumTx2;
    }
    printk(KERN_ALERT "GPIO: %d \n",FpgaProgramGpioNum  );

    if( 0 != FpgaProgramGpioNum)
	{
	    printk(KERN_ALERT "GPIO: Start Bitbanging\n" );
	    for (int i = 0 ; i < 20 * 1000 * 1000 ; ++i)
	    {
		gpio_set_value(FpgaProgramGpioNum , 0);
		gpio_set_value(FpgaProgramGpioNum , 1);
	    }
	    printk(KERN_ALERT "GPIO: Finish Bitbanging! \n" );

	    gpio_free(FpgaProgramGpioNum);
	}
}
}

On the TX1, this runs in 10 seconds indicating 2MHz toggle speed.
On the TX2, this runs in 59 seconds indicating 0.33MHz toggle speed.

For the userspace mmapped registers,
the TX1 I can toggle at 11MHz ,
but the TX2 only at 0.5MHz.

In addition for the userspace code, The time taken by the TX1 can be directly correlate with cpu frequency, but the time taken by the TX2 to toggle the pins is invariant to cpu frequency.

All,

Glad to hear you are going to try and replicate internally. Let me know if I can provide any additional details but akmal.ali seems to have provided enough detail. I am getting similar results for speed to what he has measured according to my signal analyzer.

I personally never worked w/ the bitbang on TX1 like akmal.ali has. I only have worked on the TX2 with the pins and am not getting the performance i expect.

Also in case it is relevant i am using the following pins for the bitbang which differ from akmal.ali’s pins.

GPIO3_PI.05 == 389
GPIO3_PB.04 == 332

Hi all,

The delay is because of two register writes in tx2 code compared to tx1. The control register value needs to be updated only while setting the direction to output. It is not required to be overwritten every time the value is changed.

You need the following change in kernel/nvidia to get the required perf results.

--- a/drivers/gpio/gpio-tegra186.c
+++ b/drivers/gpio/gpio-tegra186.c
@@ -1079,7 +1079,6 @@ static void tegra_gpio_set(struct gpio_chip *chip, unsigned offset, int value)
        u32 val = (value) ? 0x1 : 0x0;
 
        tegra_gpio_writel(tgi, val, offset, GPIO_OUT_VAL_REG);
-       tegra_gpio_writel(tgi, 0, offset, GPIO_OUT_CTRL_REG);
 }
 
 static int tegra_gpio_get(struct gpio_chip *chip, unsigned offset)
@@ -1129,6 +1128,7 @@ static int tegra_gpio_direction_output(struct gpio_chip *chip, unsigned offset,
        int ret;
 
        tegra_gpio_set(chip, offset, value);
+       tegra_gpio_writel(tgi, 0, offset, GPIO_OUT_CTRL_REG);
        set_gpio_direction_mode(chip, offset, 1);
        tegra_gpio_enable(tgi, offset);
        ret = pinctrl_gpio_direction_output(chip->base + offset);

@mantravadi_karthik

Having tried the change, I find the userspace mmapped code to be unaffected.
The kernel-space code is slightly faster. 46 seconds instead of 59s. i.e. 0.43MHz compared to 0.33MHz previously.

However toggling gpio pins on the TX1 is much faster (2MHz from kernel space) and 11MHz from userspace.

It does seem like simply reading/writing the gpio registers on the TX2 is slower.

Just wondering about what performance you are getting on the TX2?

@Nvidia

Any updates on this?

@akmal.ali,
We are investigating the gap in performance. The memory speeds are similar in the both the boards. However, there a small increase in performance with running jetson_clock.sh which sets the board to max-performance mode.

Can you try running your application attached to A57 core only?

taskset 0x6

@bbasu

I assume you mean Denver core, as the denver cores are cores 1 and 2 corresponding to a mask of 0x6. regardless, I have tried both taskset 0x01, 0x02, 0x04 and have changed the priority of the thread to SCHED_RR. I haven’t found this to change the speed at which I can toggle GPIOs however. Speed is the same whether I use Denver core or A57 cores.

Hi,

We have compared GPIO bit toggling performance between TX1 & TX2. We are seeing no difference in their performance. Attaching gpio_perf_debugfs script which toggles the bit value. Attaching script and results which are seen.
In gpio_perf_debugfs script,you will need to provide gpio and count values. For ex.
In TX1: gpio=187 count=100000
In TX2: gpio=387 count=100000
You can use any gpio number which is not being used by the driver.
gpio_perg_debugfs.sh

#!/bin/bash

echo "GPIO Performance Analysis Test"
echo -e "******************************\n"

gpio=$1
count=$2

echo $gpio > /sys/class/gpio/export
if [ $? -eq 0 ]; then
        echo "Exported GPIO-$gpio"
else
        echo "Failed to export GPIO-$gpio"
        echo $gpio > /sys/class/gpio/unexport
        echo "Unexported GPIO-$gpio"
        echo $gpio > /sys/class/gpio/export
        echo "Exported GPIO-$gpio"
fi

echo out > /sys/class/gpio/gpio$gpio/direction

SECONDS=0
echo "Started Bit-Banging"
for i in $(seq 1 $count)
        do
                sec=SECONDS
                echo 1 > /sys/class/gpio/gpio$gpio/value > /dev/null
                echo 0 > /sys/class/gpio/gpio$gpio/value > /dev/null
        done
echo "Ended Bit-Banging"
echo -e "\tToggled $i times"
echo -e "\tTotal Time taken $SECONDS"
if [ $SECONDS -eq 0 ]; then
        echo "Increase count"
else
        echo -e "frequency of the toggle $(( $i / $SECONDS ))" 
fi
echo $gpio > /sys/class/gpio/unexport
echo -e "Unexported GPIO-$gpio\n"
echo "******************************"

Thanks & Regards,
Shubhi

Bit_banging_results.txt (4.48 KB)

Your script uses SysFs to toggle GPIOs and hence the toggle speed is limited by the overhead.
The script you are using only show frequencies of ~15 kHz. Whereas the difference I am seeing, involves writing to memory-mapped registers. On the TX1 I can toggle at 11 MHz . On the TX2 I can only toggle at 0.5 MHz.

Given I can toggle gpios at 0.5MHz, the test which only goes up to ~ 15 kHz doesn’t show the difference.

Also invested in this, can NVIDIA please provide clarity.

@NVIDIA

Any updates on this? Would be good to know whether you expect that Gpio toggle speeds can be increased or whether the observed performance is the actual device limit.