Bad interaction on nvcsi clock with two different sensors on the same TX2 board

Hello,

I have written drivers for multiple different sensors, and I can connect two different sensors to the same carrier-board equipped with a TX2 running kernel 4.9 from l4t-32.3.1 (jetpack-4.3). Let’s say I have type A sensor, type B sensor and type C sensor. When I connect a type A sensor and a type B sensor, I can successfully run this simple acquisition test which test one sensor after the other :

for d in /dev/video*
do
        echo $d
        v4l2-ctl -d $d -C sensor_type
        v4l2-ctl -d $d --list-framesizes=GREY | while read tag discrete size
        do
                if [ "$tag" != "Size:" ]
                then
                        continue
                fi
                set -- $(echo $size | tr x ' ')
                WIDTH=${1}
                HEIGHT=${2}
                FRAMERATE=$(v4l2-ctl -d $d --list-frameintervals=width=${WIDTH},height=${HEIGHT},pixelformat=GREY | sed -n \
                        -e 's,.*-\([0-9]*\)\.\([0-9]*\) fps.*,\1\2/1000,p')
                echo WIDTH=$WIDTH
                echo HEIGHT=$HEIGHT
                echo FRAMERATE=$FRAMERATE
                TRUNCATED_FRAMERATE=$(echo "scale=3; $FRAMERATE" | bc)
                v4l2-ctl -d $d -p $TRUNCATED_FRAMERATE
                sleep 1
                v4l2-ctl -d $d --stream-mmap=3 --stream-count=256 --set-fmt-video=width=${WIDTH},height=${HEIGHT},pixelformat=GREY
        done
done

When I connect a type A and a type B sensor, both answer perfectly, but if with exactly the same kernel and dtb I connect a type C and a type B sensor, type C sensor acquisition works perfectly, but type B acquisition fails with the following error messages :

[ 1718.801357] tegra-vi4 15700000.vi: PXL_SOF syncpt timeout! err = -11                                                                    
[ 1718.807897] tegra-vi4 15700000.vi: tegra_channel_error_recovery: attempting to reset the capture channel

Here are the dtb properties of the different sensor types :

/* type A - only one mode */
                csi_pixel_bit_depth = "12";
                line_length = "4112";
                active_w = "4112";
                active_h = "3008";
                pix_clk_hz = "539902310"; /* 4112 * 3008 * 29.1 * 12 / 8 */
                max_framerate = "29.1";
/* type B - only one mode */
                csi_pixel_bit_depth = "12";
                line_length = "4112";
                active_w = "4112";
                active_h = "3008";
                pix_clk_hz = "434148250"; /* 4112 * 3008 * 23.4 * 12 / 8 */
                max_framerate = "23.4";
/* type C mode 0 */
                csi_pixel_bit_depth = "12";
                line_length = "1936";
                active_w = "1936";
                active_h = "1096";
                pix_clk_hz = "190967040"; /* 1936 * 1096 * 60 * 12 / 8 */
                max_framerate = "60";

/* type C mode 1 */
                csi_pixel_bit_depth = "12";
                line_length = "2464";
                active_w = "2464";
                active_h = "2056";
                pix_clk_hz = "271283443"; /* 2464 * 2056 * 35.7 * 12 / 8 */
                max_framerate = "35.7";

If I run the following script before running my test script, however, the test script succeeds

echo 1 > /sys/kernel/debug/bpmp/debug/clk/vi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/isp/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/nvcsi/mrq_rate_locked
for f in vi isp nvcsi
do
max_rate=$(cat /sys/kernel/debug/bpmp/debug/clk/$f/max_rate)
current_rate=$(cat /sys/kernel/debug/bpmp/debug/clk/$f/rate)
echo $f current ${current_rate} max ${max_rate}
echo ${max_rate} > /sys/kernel/debug/bpmp/debug/clk/$f/rate
done

Is there something wrong in my device-tree, or is that rather a bug in the usage of the device-tree by the kernel ?

Have a reference to below topic to check if the pix_clk_hz is correct and can try to add serdes_pix_clk_hz in the device tree to try.

https://devtalk.nvidia.com/default/topic/1073028

The line_length and pix_clkz_hz seems to be correct and enough to make the type B sensor work, as when I remove the other sensor, type B sensor becomes /dev/video0 and delivers its frames without error. I surmise that somewhere in the drivers provided by nvidia the vi or the nvcsi or something else that I do not know is initialised using the properties of the first sensor only and not updated when using the second sensor

There’s no any specific worked for the first sensor. The video node are the system register sequency depend on the i2c bus and address.
If boost the clock without problem that tell the bandwidth cause the problem.
Still need to confirm below is correct.

	num_csi_lanes = <4>;
	max_lane_speed = <1500000>;
	min_bits_per_pixel = <10>;
	vi_peak_byte_per_pixel = <2>;

Each sensor is connected to 4 csi lanes, and all my sensors are RAW12 sensors, so here is my device-tree config :

num_csi_lanes = <8>;
min_bits_per_pixel = <12>;
max_lane_speed = <1500000>;
vi_peak_byte_per_pixel = <2>;

I also enabled and added debugging in the kernel and saw that the difference between the working and the not working situation is a call to clk_bpmp_set_rate :

@@ -2369,7 +2358,7 @@
vi4_channel_start_streaming calling tegra_channel_set_stream(true)
nvhost_vi4_aggregate_constraints
clk_bpmp_set_rate: vi(51): 115200000
-clk_bpmp_set_rate: nvcsi(180): 75000000 # works
+clk_bpmp_set_rate: nvcsi(180): 37500000 # fails
clk_bpmp_set_rate: isp(50): 243200000
nvcsi 150c0000.nvcsi: csi4_start_streaming port_idx=0, lanes=4
nvcsi 150c0000.nvcsi: csi4_stream_init

Looking further, I also saw that nothing is added to the nvcsi clock in ‘tegra_camera_update_clknbw’ in the non working case.

I continued searching and I think I have found that the nvcsi clock is set from the ‘info->phy_pixel_rate’ field which itself is set in the ‘update_platform_data’ function in the file ‘drivers/video/tegra/camera/tegra_camera_platform.c’, probably from data related to the first sensor, and not changed afterwards. Can you have a look and propose a patch ?

Can you print out it and launch one and two sensor to check. And trace the tegra_camera_update_isobw()

Here are already the failing (.ko) and success (.ok) output of ‘dmesg’ with plenty of messages that I added. You’ll see the differences by comparing the two files. Both are with two sensors, but with a different ‘first’ sensor.dmesg-8v.ok.loggz (22.0 KB) dmesg-8v.ko.loggz (22.3 KB) . The setting of ‘info->phy_pixel_rate’ happens at boot, long before I begin to use the sensors.

Here are the same logs, with DEBUG now defined for the compilation of tegra_camera_update_isobw. For completeness, here are all the debugging patches that I added added_debug.patch.loggz (6.1 KB) and here are the logs dmesg-9v.ok.loggz (22.2 KB) , dmesg-9v.ko.loggz (22.5 KB)

Hello ShaneCCC,

do you need more info ?

The nvcsi clock should be change for value of pix_clk_hz/serdes_pix_clk_hz

The value of pix_clk_hz differs with the type of sensor. They are read from the DT by sensor_common_parse_signal_props, which is called for each mode of the sensor by sensor_common_init_sensor_properties, called by camera_common_initialize, which is called by the _probe function of my driver, like in the drivers provided by nvidia. After that, in the probe function, the driver calls v4l2_spi_subdev_init, then tegra_media_entity_init and finally v4l2_async_register_subdev. That part works perfectly, and the the nvcsi clock is set to a clock matching the pixel_clk_hz value of the first sensor, but thereafter, the value of pixel_clk_hz specific to a sensor is never used to configure the nvcsi clock, although it is used to calculate a settle time. That’s the problem !

What’s max_pixel_rate in you DT setting?
Please check the file …/kernel/nvidia/drivers/video/tegra/camera/tegra_camera_platform.c to check the if any clue.

There is none at the moment, but the drivers do not complain. What should it be ? Reading camera_platform.c, I see that max_pixel_rate is only used to compute isp_iso_bw, but my test does not use the ISP as I use only v4l2.

The problem is really that the nvcsi is not set to the value it should have. I have traced all the calls to clk_bpmp_set_rate for the nvcsi clock, with the surrounding functions, and I see that for /dev/video1, where the nvcsi clock should be set to 75000000 according to pix_clk_hz in the DT, it is set to only 37500000, a value that works for /dev/video0, and that seems to have been computed only once, but I do not know where. However, if /dev/video0 is a sensor requiring also 75000000, then /dev/video1 works.

I have attached a trimmed down and annotated log dmesg-summary.loggz (2.1 KB)

I didn’t trace all of the code as well, but found some user modify the max_lane_speed as much bigger to WAR it.
Could you have it a try.

Thanks

Hello ShaneCC

I prefer a fix to a WAR. I have continued my analysis. Here it is. In the file drivers/video/tegra/camera/tegra_camera_platform.c, the function calculate_and_set_device_clock contains the following lines

        u64 active_pr = info->active_pixel_rate;
        u64 phy_pr = info->phy_pixel_rate;
        ...
        u64 final_pr = (cdev->use_max) ? phy_pr : active_pr;
        bool set_clk = true;

        switch (cdev->hw_type) {
        case HWTYPE_CSI:
                ...
                nr = max_depth * final_pr * overhead;
                dr = bus_width * 100;
                ...
                break;
                ...
       }

        /* avoid rounding errors by adding dr to nr */
        clk_rate = (nr + dr) / dr;

One sees that the clock rate depends on final_pr, which itself is either info->active_pixel_rate or info->phy_pixel_rate, depending on the value of cdev->use_max. The name of that variable seems to imply that the original programmer expected phy_pixel_rate to be bigger than active_pixel_rate. When entering here for the nvcsi clock for my second (bigger) sensor, I have seen that use_max has the value ‘true’ (use_max is set to true at boot time in nvcsi_probe in drivers/video/tegra/host/nvcsi/nvcsi.c), but that phy_pixel_rate is lower than active_pixel_rate. If I decide to use the value of active_pixel_rate instead of the value of phy_pixel_rate to compute clk_rate, my second sensor succeeds to stream.

I have also seen that at boot time, update_platform_data (from the file drivers/video/tegra/camera/tegra_camera_platform.c) is called twice to add cdev->pixel_rate to info->phy_pixel_rate. I assume the original programmers expected to set info->phy_pixel_rate to the sum of the maximum pixel rate of all the sensors, but when update_platform_data is called, cdev->pixel_rate is 0 and thus nothing is added to info->phy_pixel_rate. I did not check where info->phy_pixel_rate is finally set to a non-null value, but it happens only once, also in update_platform_data, when starting to work with the first sensor. There thus seems to be a BUG in either update_platform_data, or in the time or place where it is called, because it should never be called with cdev->pixel_rate = 0.

I have made my own fix, that currently solves the problem for me :

diff --git a/drivers/video/tegra/host/nvcsi/nvcsi.c b/drivers/video/tegra/host/n
vcsi/nvcsi.c
index fff1ad3..fc05b23 100644
--- a/drivers/video/tegra/host/nvcsi/nvcsi.c
+++ b/drivers/video/tegra/host/nvcsi/nvcsi.c
@@ -234,7 +234,9 @@ static int nvcsi_probe(struct platform_device *dev)
 
        csi_info.pdev = dev;
        csi_info.hw_type = HWTYPE_CSI;
+#if 0
        csi_info.use_max = true;
+#endif
        csi_info.bus_width = CSI_BUS_WIDTH;
        csi_info.lane_num = NUM_LANES;
        csi_info.pg_clk_rate = PG_CLK_RATE;

Can you validate that patch or propose a better one that would fix update_platform_data or the way/place/time it is called ?

Best regards

PS: Is there an easier way than this forum to discuss bugs and submit patches, e.g. a mailing list ?

I think you can use this solution as your case in short term.
Except partner we all have discuss jetson’s problem on this forum currently.

What are the conditions to become partner ?

Hello ShaneCCC

adding more debugging I finally discovered that the clock rates after the first one are not used, because of the strange test in ‘update_platform_data’ in ./drivers/video/tegra/camera/tegra_camera_platform.c

                if (info->total_sensor_lanes <= info->num_device_lanes)
                        info->phy_pixel_rate += cdev->pixel_rate;

The culprit commit is :

commit 4f1f6523d569a1840ee3d7eb14265e8084cb010c
Author: Bhushan Rayrikar <brayrikar@nvidia.com>
Date:   Wed Jul 18 13:00:42 2018 -0700

    camera: fix csi clock calculations

    We were overclocking CSI by aggregating pixel rate for
    sensors on different bricks. This is not needed as all the
    CSI bricks get the same clock in parallel.

    Bug 1888833

    Change-Id: Iba5bd447494107e0c9cb9e5951d6580044a7ae75
    Signed-off-by: Bhushan Rayrikar <brayrikar@nvidia.com>
    Reviewed-on: https://git-master.nvidia.com/r/1781101
    Reviewed-by: Automatic_Commit_Validation_User
    Reviewed-by: Vincent Chung <vincentc@nvidia.com>
    Reviewed-by: Ian Kaszubski <ikaszubski@nvidia.com>
    GVS: Gerrit_Virtual_Submit
    Reviewed-by: Jon Mayo <jmayo@nvidia.com>
    Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
    Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>

While the comment is probably right, and we should avoid overclocking CSI, the implementation should select the maximum needed clock rate of all the active sensors, not merely the needed clock rate for the first registered sensor.

Unfortunately my previously proposed fix overclocks the CSI :(

Here is another patch attempt, that seems to fit my needs, using the max of all cdev-pixel_rate’s instead of the first cdev-pixel_rate, but does not handle the case of deregistering sensors.

diff --git a/drivers/video/tegra/camera/tegra_camera_platform.c b/drivers/video/tegra/camera/tegra_camera_platform.c
index 81aebfb..faa83c7 100644
--- a/drivers/video/tegra/camera/tegra_camera_platform.c
+++ b/drivers/video/tegra/camera/tegra_camera_platform.c
@@ -664,17 +693,22 @@ static void update_platform_data(struct tegra_camera_dev_info *cdev,
                if (cdev->bpp > 2)
                        info->ppc_divider = 2;
                if (cdev->sensor_type != SENSORTYPE_NONE)
                        info->total_sensor_lanes += cdev->lane_num;
+#if 0
                if (info->total_sensor_lanes <= info->num_device_lanes)
                        info->phy_pixel_rate += cdev->pixel_rate;
+#else
+               /* set phy_pixel_rate to max of all cdev->pixel_rate */
+               if (info->phy_pixel_rate < cdev->pixel_rate)
+                       info->phy_pixel_rate = cdev->pixel_rate;
+#endif
                if (info->max_pixel_depth < cdev->pixel_bit_depth)
                        info->max_pixel_depth = cdev->pixel_bit_depth;
                if (info->memory_latency < cdev->memory_latency)
                        info->memory_latency = cdev->memory_latency;
        } else {
                if (info->total_sensor_lanes <= info->num_device_lanes)
                        info->phy_pixel_rate -= cdev->pixel_rate;
                info->total_sensor_lanes -= cdev->lane_num;
        }
 }

Can you submit it to Bhushan Rayrikar for comments ?

This change may not working for the multiple camera launch at the same time.
The total_sensor_lanes should be always bigger than num_device_lanes, what’s your total_sensor_lanes?