本帖最后由 Juzi丶 于 2021-10-8 22:43 编辑
加楼说一下机器的初始和常用设置吧
必须两个电源都接上电
准备好PuTTY,或者你习惯的工具
前面板有两个RJ45接口
上面的是ETH管理口
下面的是Console口
首先先把上面的ETH管理口接上路由器,之后会获取DHCP分配的IP
第一次开机需要接下面的Console口
然后按下图设置,USB转Console线建议买FTDI芯片的
等屏幕上信息滚完之后按一下回车会出现控制台
输入默认账号和密码admin登陆控制台
- NVIDIA Onyx Switch Management
- switch-xxxxxx login: admin
- Password:
- Number of total successful connections since last 1 days: 0
- Your password has been changed
- NVIDIA Switch
复制代码
第一句是问你是否进行初始设置
- Do you want to use the wizard for initial configuration?
- 您想使用向导进行初始配置吗?
复制代码
输入yes
然后跟着向导输入一些基本信息
- Step 1: Hostname?
- Step 2: Use DHCP on mgmt0 interface?
- Step 3: Enable IPv6?
- Step 4: Update time?
- Step 5: Enable password hardening?
- Step 6: Admin password (Must be typed)?
- Step 6: Confirm admin password?
- Step 7: Monitor password (Must be typed)?
- Step 7: Confirm monitor password?
- 第 1 步:主机名?
- 第 2 步:在 mgmt0 接口上使用 DHCP?
- 第 3 步:启用 IPv6?
- 第 4 步:更新时间?
- 第 5 步:启用密码强度?
- 第 6 步:管理员密码(必须输入)?
- 第 6 步:确认管理员密码?
- 第 7 步:监控密码(必须输入)?
- 第 7 步:确认监控密码?
复制代码
密码强度就是禁用简单密码,yes之后后面两个密码都需要大小写数字和符号
下面是我的示例
- NVIDIA Onyx Switch Management
- switch-xxxxxx login: admin
- Password:
- Number of total successful connections since last 1 days: 0
- Your password has been changed
- NVIDIA Switch
- Configuration wizard
- Do you want to use the wizard for initial configuration?
- Step 1: Hostname? [switch-xxxxxx]
- Step 2: Use DHCP on mgmt0 interface? [yes]
- Step 3: Enable IPv6? [yes] no
- Step 4: Update time? [2021/10/08 03:57:40]
- Step 5: Enable password hardening? [yes] no
- Step 6: Admin password (Must be typed)?
- Step 6: Confirm admin password?
- Step 7: Monitor password (Must be typed)?
- Step 7: Confirm monitor password?
- You have entered the following information:
- 1. Hostname: switch-xxxxxx
- 2. Use DHCP on mgmt0 interface: yes
- 3. Enable IPv6: no
- 4. Update time: 2021/10/08 03:57:59
- 5. Enable password hardening: no
- 6. Admin password (Must be typed): (CHANGED)
- 7. Monitor password (Must be typed): (CHANGED)
- To change an answer, enter the step number to return to.
- Otherwise hit <enter> to save changes and exit.
- Choice:
- Zero-touch is disabled
- Configuration changes saved.
- To return to the wizard from the CLI, enter the "configuration jump-start"
- command from configure mode. Launching CLI...
- switch-xxxxxx [standalone: master] >
复制代码
然后进入配置模式,在CLI下修改任何配置都要先输入enable然后configure terminal
- switch-xxxxxx [standalone: master] >
- switch-xxxxxx [standalone: master] > enable
- switch-xxxxxx [standalone: master] # configure terminal
- switch-xxxxxx [standalone: master] (config) #
复制代码
然后先解锁模块
- switch-xxxxxx [standalone: master] (config) # fae cable-stamping-unlock 100g_lr4
- switch-xxxxxx [standalone: master] (config) # fae cable-stamping-unlock 40g_lr4
- switch-xxxxxx [standalone: master] (config) # fae cable-stamping-unlock eth_100g
- switch-xxxxxx [standalone: master] (config) # fae cable-stamping-unlock eth_sfp_25g
复制代码
任何更改都不会自动保存,CLI要输入命令保存,在WEB网页上的话就是右上角的SAVE或者存档标准
- switch-xxxxxx[standalone: master] (config) # configuration write
- switch-xxxxxx[standalone: master] (config) #
复制代码
风扇转速会在25分钟左右降速到20%(6000~7000RPM)
开机后转速是60%,每5分钟左右降10%
上面初始配置设置了管理口DHCP获取IP,所以我们可以使用下面的命令获取管理口状态
- switch-xxxxxx [standalone: master] (config) # show interfaces mgmt0 brief
- Interface mgmt0 status:
- Comment :
- VRF : mgmt
- Admin up : yes
- Link up : yes
- DHCP running : yes
- IP address : 10.0.0.181
- Netmask : 255.0.0.0
- IPv6 enabled : no
- Speed : 1000Mb/s (auto)
- Duplex : full (auto)
- Interface type : ethernet
- Interface source: bridge
- Bonding master : vrf_mgmt
- MTU : 1500
- HW address : xx:xx:xx:xx:xx:xx
复制代码
可以看到获取的IP是10.0.0.181
这时候就可以用IP登陆SSH控制台而不用Console了
同理,输入账号密码登陆,然后输入enable和configure terminal进入配置模式
使用25G/100G链路必须设置FEC
下面是示例
- #关闭端口自动协商,设置端口速率为25G
- switch-xxxxxx [standalone: master] (config) # interface ethernet 1/7 speed 25G no-autoneg force
- #设置FEC模式,可以看到有RS FC NO三个可以选,这里选择RS
- switch-xxxxxx [standalone: master] (config) # interface ethernet 1/7 fec-override
- fc-fec no-fec rs-fec
- switch-xxxxxx [standalone: master] (config) # interface ethernet 1/7 fec-override rs-fec force
- switch-xxxxxx [standalone: master] (config) #
复制代码
在电脑上Mellanox网卡的话,可以安装官网的驱动程序和MFT后使用mlxlink验证链路模式
MFT下的工具必须在管理员模式运行
mlxlink.bat -d mt4117_pciconf0的网卡的第一个接口
mlxlink.bat -d mt4117_pciconf0.1的网卡的第二个接口
下面是示例(Windows平台)
- Windows PowerShell
- 版权所有 (C) Microsoft Corporation。保留所有权利。
- 尝试新的跨平台 PowerShell https://aka.ms/pscore6
- PS C:\Windows\system32> cmd
- Microsoft Windows [版本 10.0.19044.1237]
- (c) Microsoft Corporation。保留所有权利。
- C:\Windows\system32>cd C:\Program Files\Mellanox\WinMFT
- C:\Program Files\Mellanox\WinMFT>
- C:\Program Files\Mellanox\WinMFT>mlxfwmanager.exe
- Querying Mellanox devices firmware ...
- Device #1:
- ----------
- Device Type: ConnectX4LX
- Part Number: MCX4121A-ACU_Ax
- Description: ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; UEFI Enabled; tall bracket
- PSID: MT_0000000266
- PCI Device Name: mt4117_pciconf0
- Base MAC:
- Versions: Current Available
- FW 14.31.1014 N/A
- PXE 3.6.0403 N/A
- UEFI 14.24.0013 N/A
- Status: No matching image found
- C:\Program Files\Mellanox\WinMFT>mlxlink.bat -d mt4117_pciconf0.1
- Operational Info
- ----------------
- State : Active
- Physical state : LinkUp
- Speed : 25GbE
- Width : 1x
- FEC : Standard RS-FEC - RS(528,514)
- Loopback Mode : No Loopback
- Auto Negotiation : ON
- Supported Info
- --------------
- Enabled Link Speed : 0x38007013 (25G,10G,1G)
- Supported Cable Speed : 0x38007013 (25G,10G,1G)
- Troubleshooting Info
- --------------------
- Status Opcode : 0
- Group Opcode : N/A
- Recommendation : No issue was observed.
- C:\Program Files\Mellanox\WinMFT>
复制代码
可以看到端口的FEC已经运行在RS模式(根据交换机上的数据自动协商)
用mlxcables查看模块型号,收发光等信息
mlxcables.bat -d mt4117_pciconf0_cable_0的网卡的第一个接口
mlxcables.bat -d mt4117_pciconf0_cable_1的网卡的第二个接口
- C:\Program Files\Mellanox\WinMFT>mlxcables.bat -d mt4117_pciconf0_cable_1 -q
- Querying Cables ....
- Cable #1:
- ---------
- Cable name : mt4117_pciconf0_cable_1
- >> No FW data to show
- -------- Cable EEPROM --------
- Identifier : SFP/SFP+/SFP28 (03h)
- Technology : Transceiver
- Compliance : Unspecified
- OUI : 0xac4afe
- Vendor : Hisense
- Serial number : UBU9C083728
- Part number : LTF1325-BH1
- Revision : A
- Temperature : N/A
- Length : 0 m
复制代码
- C:\Program Files\Mellanox\WinMFT>mlxcables.bat -d mt4117_pciconf0_cable_1 -DDM
- Cable DDM:
- ----------
- Temperature : 52C
- Voltage : 3.2639V
- RX Power : -1.1351dBm
- TX Power : -2.1120dBm
- TX Bias : 53.8300mA
- ----- Flags -----
- Temperature:
- [32m Alarm high : 0
- [32m Warning high : 0
- [32m Warning low : 0
- [32m Alarm low : 0
- [0mVoltage:
- [32m Alarm high : 0
- [32m Warning high : 0
- [32m Warning low : 0
- [32m Alarm low : 0
- [0mRX/TX Power and TX Bias:
- [32m RX Power alarm high : 0
- [32m RX Power warning high: 0
- [32m RX Power warning low : 0
- [32m RX Power alarm low : 0
- [32m TX Power alarm high : 0
- [32m TX Power warning high: 0
- [32m TX Power warning low : 0
- [32m TX Power alarm low : 0
- [32m TX Bias alarm high : 0
- [32m TX Bias warning high : 0
- [32m TX Bias warning low : 0
- [32m TX Bias alarm low : 0
- [0m----- Thresholds -----
- Temperature high alarm threshold : 95C
- Temperature high warning threshold : 85C
- Temperature low warning threshold : -40C
- Temperature low alarm threshold : -50C
- Voltage high alarm threshold : 3.6300V
- Voltage high warning threshold: 3.4650V
- Voltage low warning threshold: 3.1350V
- Voltage low alarm threshold: 2.9700V
- RX Power high alarm threshold : 5.0000dBm
- RX Power high warn threshold : 2.0000dBm
- RX Power low warn threshold : -10.5012dBm
- RX Power low alarm threshold : -13.4969dBm
- TX Power high alarm threshold : 5.0000dBm
- TX Power high warn threshold : 2.0000dBm
- TX Power low warn threshold : -7.0006dBm
- TX Power low alarm threshold : -10.0000dBm
- TX Bias high alarm threshold : 110.0000mA
- TX Bias high warn threshold : 100.0000mA
- TX Bias low warn threshold : 1.0000mA
- TX Bias low alarm threshold : 1.0000mA
- [0m
复制代码
有关交换机接口的其他设置可以看官方文档
https://docs.mellanox.com/display/Onyxv393202/Ethernet+Interfaces
https://docs.mellanox.com/displa ... +Interface+Commands
RoCE文档
https://docs.mellanox.com/pages/viewpage.action?pageId=56986516
机箱管理
https://docs.mellanox.com/display/Onyxv393202/Chassis+Management
如果发现任何异常的地方可以抓取日志查看详情
如果日志出现下面错误
- Oct 7 04:28:57 switch-xxxxxx temp_control[8490]: [tc.NOTICE]: Read all qsfp temperatures properly:[false], changing dynamic ambient mode
-
- Oct 7 04:28:57 switch-xxxxxx temp_control[8490]: [tc.NOTICE]: Dynamic ambient usage: Enabled. NOT all qsfps were read properly
-
- Oct 7 04:28:57 switch-xxxxxx temp_control[8490]: [tc.NOTICE]: minimum chassis fan speed - previous [20%] current [50%] after reading ambient temperature of [33.50 C]
-
- Oct 7 04:28:57 switch-xxxxxx temp_control[8490]: [tc.NOTICE]: Fan:[/MGMT/FAN1/f1], interval:[57] ,in affected area, max temperatures: ASIC:[48] X86:[33] QSFP_CABLE:[0] , Updating fan speed from:[20%] to:[50%]
复制代码
注意这两句
Read all qsfp temperatures properly:[false], changing dynamic ambient mode
Dynamic ambient usage: Enabled. NOT all qsfps were read properly
意思就是temp_control没有正确读到模块的温度数据
哪怕执行
show interfaces ethernet x/x transceiver diagnostics
后能读取到模块温度
机箱管理还是照样会把风扇转速提高到最低40%(11000~12000PRM)
如果遇到这种情况请更换模块,或者用官方模块
(或者等nvidia良心发现修复一下,又或者换cumulus linux或者sonic,再或者你的使用环境可以无视风扇提速) |