Melanox grid director 4036e won't boot.

I have been asked to look at the aforementioned 4036e. This is my first time with Melanox switches.

No warning LEDS. All green. Power supply and fans ok.

Boots then crashes at different places in the boot sequence.

I am seeing a ‘Warning - Bad CRC’ before the switch decides to boot from the secondary flash.

The boot sequence creates 9 partitions.

When we get to the NAND device line it scans for bad blocks.

Then it creates 1 MTD partition.

Later it identifies a bad area from kernel access.

Then we just get a call trace and instruction dump and the loading process halts.

Line connection no longer responds.

I suspect a bad/faulty NAND flash chip.

Does anyone have any suggestions, is this replaceable? Should I try flashing the firmware.

I am not currently at that site, I will visit on Sunday and copy the full configuration then post back here.

I would appreciate any suggestions or ideas.

Many thanks.

Switchman.

Legacy equipment out of warranty.

Are we able to purchase a service contract or is this equipment unsupported?

Many thanks.

No unfortunately that does not work.

The run flash_self_safe forces the switch to boot from secondary and we get the exact same error output.

From uboot I attempted to download an image by tftp but the file transfer begins, the switch outputs an error and boots to the same error.

I opened the unit and there are 4 red LEDs so I suspect a hardware failure.

The LEDS are as follows:

Top row of LEDS (next to RAM module, below the chassis fans)

D104 + D105 on RED

LEDS in bottom right of chassis

CPLD 2 R643 - D87 RED

CPLD 4 R645 - D89 RED

The LEDs turn red shortly after power is applied.

Do you know what may have failed? Will the failed components be replaceable as this is a legacy unit and this with another 7 switches may have a similar problem.

Kind regards.

Rav.

You should contact sales guys

In UBOOT try running the command:

run flash_self_safe

This will bring the 4036 to the primary kernel where you can recover using “update software”

I had saved a portion the end of the output and then attempting to boot a second time at the bottom:

============================================================================

Intel/Sharp Extended Query Table at 0x010A

Intel/Sharp Extended Query Table at 0x010A

Intel/Sharp Extended Query Table at 0x010A

Intel/Sharp Extended Query Table at 0x010A

Using buffer write method

Using auto-unlock on power-up/resume

cfi_cmdset_0001: Erase suspend on write enabled

cmdlinepart partition parsing not available

RedBoot partition parsing not available

Creating 9 MTD partitions on “4cc000000.nor_flash”:

0x00000000-0x001e0000 : “kernel”

0x001e0000-0x00200000 : “dtb”

0x00200000-0x01dc0000 : “ramdisk”

0x01dc0000-0x01fa0000 : “safe-kernel”

0x01fa0000-0x01fc0000 : “safe-dtb”

0x01fc0000-0x03b80000 : “safe-ramdisk”

0x03b80000-0x03f60000 : “config”

0x03f60000-0x03fa0000 : “u-boot env”

0x03fa0000-0x04000000 : “u-boot”

NAND device: Manufacturer ID: 0x20, Chip ID: 0xda (ST Micro NAND 256MiB 3,3V 8-bit)

Scanning device for bad blocks

Creating 1 MTD partitions on “4e0000000.ndfc.nand”:

0x00000000-0x10000000 : “log”

i2c /dev entries driver

IBM IIC driver v2.1

ibm-iic(): using standard (100 kHz) mode

ibm-iic(): using standard (100 kHz) mode

i2c-2: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 0)

i2c-3: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 1)

i2c-4: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 2)

i2c-5: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 3)

rtc-ds1307 6-0068: rtc core: registered ds1338 as rtc0

rtc-ds1307 6-0068: 56 bytes nvram

i2c-6: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 4)

i2c-7: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 5)

i2c-8: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 6)

i2c-9: Virtual I2C bus (Physical bus i2c-0, multiplexer 0x70 port 7)

pca954x 0-0070: registered 8 virtual busses for I2C switch pca9548

TCP cubic registered

NET: Registered protocol family 10

lo: Disabled Privacy Extensions

IPv6 over IPv4 tunneling driver

sit0: Disabled Privacy Extensions

ip6tnl0: Disabled Privacy Extensions

NET: Registered protocol family 17

RPC: Registered udp transport module.

RPC: Registered tcp transport module.

rtc-ds1307 6-0068: setting system clock to 2000-01-18 01:06:09 UTC (948157569)

RAMDISK: Compressed image found at block 0

VFS: Mounted root (ext2 filesystem) readonly.

Freeing unused kernel memory: 172k init

init started: BusyBox v1.12.2 (2011-01-03 14:13:22 IST)

starting pid 15, tty ‘’: ‘/etc/rc.d/rcS’

mount: no /proc/mounts

Mounting /proc and /sys

Mounting filesystems

Loading module Voltaire

Empty flash at 0x0cdcf08c ends at 0x0cdcf800

Starting crond:

Starting telnetd:

ibsw-init.sh start…

Tue Jan 18 01:06:42 UTC 2000

INSTALL FLAG 0x0

starting syslogd & klogd …

Starting ISR: Unable to handle kernel paging request for data at address 0x0000001e

Faulting instruction address: 0xc00ec934

Oops: Kernel access of bad area, sig: 11 [#1]

Voltaire

Modules linked in: ib_is4(+) ib_umad ib_sa ib_mad ib_core memtrack Voltaire

NIP: c00ec934 LR: c00ec930 CTR: 00000000

REGS: d7bdfd10 TRAP: 0300 Not tainted (2.6.26)

MSR: 00029000 <EE,ME> CR: 24000042 XER: 20000000

DEAR: 0000001e, ESR: 00000000

TASK = d7b9c800[49] ‘jffs2_gcd_mtd9’ THREAD: d7bde000

GPR00: 00000001 d7bdfdc0 d7b9c800 00000000 000000d0 00000003 df823040 0000007f

GPR08: 22396d59 d9743920 c022de58 00000000 24000024 102004bc c026b9a0 c026b910

GPR16: c026b954 c026b630 c026b694 c022b790 d8938150 d8301000 c022b758 d7bdfe30

GPR24: 00000000 0000037c d8301400 00000abf d9743d80 00000000 d8938158 df823000

NIP [c00ec934] jffs2_get_inode_nodes+0xb6c/0x1020

LR [c00ec930] jffs2_get_inode_nodes+0xb68/0x1020

Call Trace:

[d7bdfdc0] [c00ec758] jffs2_get_inode_nodes+0x990/0x1020 (unreliable)

[d7bdfe20] [c00ece28] jffs2_do_read_inode_internal+0x40/0x9e8

[d7bdfe90] [c00ed838] jffs2_do_crccheck_inode+0x68/0xa4

[d7bdff00] [c00f1ed8] jffs2_garbage_collect_pass+0x160/0x664

[d7bdff50] [c00f36c8] jffs2_garbage_collect_thread+0xf0/0x118

[d7bdfff0] [c000bdb8] kernel_thread+0x44/0x60

Instruction dump:

7f805840 409c000c 801d0004 48000008 801d0008 2f800000 409effdc 2f9d0000

40be0010 48000180 4802ba05 7c7d1b78 7fa3eb78 2f800000 409effec

—[ end trace b57e19dd3d61c6af ]—

ib_is4 0000:81:00.0: ep0_dev_name 0000:81:00.0

Unable to handle kernel paging request for data at address 0x00000034

Faulting instruction address: 0xc002f3b0

Oops: Kernel access of bad area, sig: 11 [#2]

Voltaire

Modules linked in: is4_cmd_driver ib_is4 ib_umad ib_sa ib_mad ib_core memtrack Voltaire

NIP: c002f3b0 LR: c002fb00 CTR: c00f3a10

REGS: df8a3de0 TRAP: 0300 Tainted: G D (2.6.26)

MSR: 00021000 CR: 24544e88 XER: 20000000

DEAR: 00000034, ESR: 00000000

TASK = df88e800[8] ‘pdflush’ THREAD: df8a2000

GPR00: c002fb00 df8a3e90 df88e800 00000001 d7b9c800 d7b9c800 00000000 00000001

GPR08: 00000001 00000000 24544e22 00000002 00004b1a 67cfb19f 1ffef400 00000000

GPR16: 1ffe42d8 00000000 1ffebfa4 00000000 00000000 00000004 c0038778 c0261ac4

GPR24: 00000001 c02f0000 00000000 d7b9c800 00000001 d7b9c800 00000000 d8301400

NIP [c002f3b0] prepare_signal+0x1c/0x1a4

LR [c002fb00] send_signal+0x28/0x214

Call Trace:

[df8a3e90] [c0021bb8] check_preempt_wakeup+0xd8/0x110 (unreliable)

[df8a3eb0] [c002fb00] send_signal+0x28/0x214

[df8a3ed0] [c002fe40] send_sig_info+0x28/0x48

[df8a3ef0] [c00f35c4] jffs2_garbage_collect_trigger+0x3c/0x50

[df8a3f00] [c00f3a40] jffs2_write_super+0x30/0x5c

[df8a3f10] [c007340c] sync_supers+0x80/0xd0

[df8a3f30] [c0054dc8] wb_kupdate+0x48/0x150

[df8a3f90] [c0055434] pdflush+0x104/0x1a4

[df8a3fe0] [c00387c4] kthread+0x4c/0x88

[df8a3ff0] [c000bdb8] kernel_thread+0x44/0x60

Instruction dump:

80010034 bb810020 7c0803a6 38210030 4e800020 9421ffe0 7c0802a6 bf810010

90010024 7c9d2378 83c4034c 7c7c1b78 <801e0034> 70090008 40820100 2f83001f

—[ end trace b57e19dd3d61c6af ]—

------------[ cut here ]------------

Badness at kernel/exit.c:965

NIP: c00273f0 LR: c000a03c CTR: c013b2b4

REGS: df8a3cb0 TRAP: 0700 Tainted: G D (2.6.26)

MSR: 00021000 CR: 24544e22 XER: 20000000

TASK = df88e800[8] ‘pdflush’ THREAD: df8a2000

GPR00: 00000001 df8a3d60 df88e800 0000000b 00002d73 ffffffff c013e13c c02eb620

GPR08: 00000001 00000001 00002d73 00000000 24544e84 67cfb19f 1ffef400 00000000

GPR16: 1ffe42d8 00000000 1ffebfa4 00000000 00000000 00000004 c0038778 c0261ac4

GPR24: 00000001 c02f0000 00000000 d7b9c800 df8a3de0 0000000b df88e800 0000000b

NIP [c00273f0] do_exit+0x24/0x5ac

LR [c000a03c] kernel_bad_stack+0x0/0x4c

Call Trace:

[df8a3d60] [00002d41] 0x2d41 (unreliable)

[df8a3da0] [c000a03c] kernel_bad_stack+0x0/0x4c

[df8a3dc0] [c000ef90] bad_page_fault+0xb8/0xcc

[df8a3dd0] [c000c4c8] handle_page_fault+0x7c/0x80

[df8a3e90] [c0021bb8] check_preempt_wakeup+0xd8/0x110

[df8a3eb0] [c002fb00] send_signal+0x28/0x214

[df8a3ed0] [c002fe40] send_sig_info+0x28/0x48

[df8a3ef0] [c00f35c4] jffs2_garbage_collect_trigger+0x3c/0x50

[df8a3f00] [c00f3a40] jffs2_write_super+0x30/0x5c

[df8a3f10] [c007340c] sync_supers+0x80/0xd0

[df8a3f30] [c0054dc8] wb_kupdate+0x48/0x150

[df8a3f90] [c0055434] pdflush+0x104/0x1a4

[df8a3fe0] [c00387c4] kthread+0x4c/0x88

[df8a3ff0] [c000bdb8] kernel_thread+0x44/0x60

Instruction dump:

bb61000c 38210020 4e800020 9421ffc0 7c0802a6 bf010020 90010044 7c7f1b78

7c5e1378 800203e0 3160ffff 7d2b0110 <0f090000> 54290024 8009000c 5409012f

U-Boot 1.3.4.32 (Feb 6 2011 - 10:18:30)

CPU: AMCC PowerPC 460EX Rev. B at 666.666 MHz (PLB=166, OPB=83, EBC=83 MHz)

Security/Kasumi support

Bootstrap Option E - Boot ROM Location EBC (16 bits)

Internal PCI arbiter disabled

32 kB I-Cache 32 kB D-Cache

Board: 4036QDR - Voltaire 4036 QDR Switch Board

I2C: ready

DRAM: 512 MB (ECC enabled, 333 MHz, CL3)

FLASH: 64 MB

NAND: 256 MiB

*** Warning - bad CRC, using default environment

MAC Address: 00:08:F1:20:52:E8

PCIE1: successfully set as root-complex

PCIE: Bus Dev VenId DevId Class Int

01 00 15b3 bd34 0c06 00

Net: ppc_4xx_eth0

Type run flash_nfs to mount root filesystem over NFS

Hit any key to stop autoboot: 0

=> run flash_nfs

Booting kernel from Legacy Image at fc000000 …

Image Name: Linux-2.6.26

Image Type: PowerPC Linux Kernel Image (gzip compressed)

Data Size: 1406000 Bytes = 1.3 MB

Load Address: 00000000

Entry Point: 00000000

Verifying Checksum … OK

Uncompressing Kernel Image … OK

In this case, I think you need RMA the switch if the switch under warranty

Thank you for the assistance.

This call can now be closed.

Hello Rav,

Mellanox do not seel servicecontracts on 4036E switches anymore. The product is at EOL stage.

for more information, please refer to our EOL info page at: http://www.mellanox.com/page/eol http://www.mellanox.com/page/eol

Sorry we couldn’t assist you.

Thanks