Cisco images failure after few minutes

Before posting something, READ the changelog, WATCH the videos, howto and provide following:
Your install is: Bare metal, ESXi, what CPU model, RAM, HD, what EVE version you have, output of the uname -a and any other info that might help us faster.

Moderator: mike

Post Reply
User avatar
bgp-lu
Posts: 5
Joined: Thu Jan 09, 2020 7:26 pm

Cisco images failure after few minutes

Post by bgp-lu » Wed May 11, 2022 7:17 pm

Hello there,

I´m triying to bring up a topology that contains several nodes. Until now, i never tried to bring up the whole nodes. I already know that the HW is limited.

Servers specs:
Hypervisor: VMware ESXi, 6.5.0, 5969303
Modelo: UCSC-C220-M4S
Type of processor: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Logical Processors: 32
NIC: 4
Total memory: 128Gb

Images used:
XRv9-k9full 7.6.1 x8
XRv-6.6.2 x2
CRS1000v 17.03.05 x4
NE40E V800R011C00SPC607B607 x15
vMX 18.4R1.8 x1

The NE40 , vMX and XRv works flawless but the XRv9 and CSR1000v have problems after be initialized a couple of minutes. The XRv9 crash or looks like goes down itself and the CSR display a log related with the CPU and goes freeze for a few minutes, after that it comes back again

XRv9

Code: Select all

0/RP0/ADMIN0:May 11 18:10:25.150 UTC: vm_manager[3262]: %INFRA-VM_MANAGER-3-MSG_HEARTBEAT_FAILURE : VM default-sdr--1 failed to maintain heartbe 
0/RP0/ADMIN0:May 11 18:10:25.169 UTC: sdr_mgr[3216]: %SM-SDR_MANAGER-3-MSG_VM_RELOAD_ON_HB_FAILURE : Info :SDR NM : VM Reload on HB failure, sdr 
0/RP0/ADMIN0:May 11 18:10:25.170 UTC: sdr_mgr[3216]: %SM-SDR_MANAGER-3-MSG_VM_UNGRACEFUL_RELOAD_TOO_OFTEN : Info :sdr default-sdr vm_id 1 ungrac 
[18:10:44.777] Sending KILL signal to processmgr..
[18:10:44.777] Sending KILL signal to ds..
PM disconnect successStopping OpenBSD Secure Shell server: sshdinitctl: Unknown instance: 
The audit system is disabled
Stopping system message bus: dbus.
Stopping random number generator daemon.
Stopping system log daemon...0
Stopping kernel log daemon...0
Stopping internet superserver: xinetd.
Stopping crond: OK
Stopping rpcbind daemon...
done.
Libvirt not initialized for container instance
Deconfiguring network interfaces... done.
Sending all processes the KILL signal...
Unmounting remote filesystems...
Deactivating swap...
Unmounting local filesystems...
Connection closed by foreign host.
Wed May 11 18:11:24 UTC 2022 (/opt/cisco/hostos/bin/xr_con_telnet_wrapper.sh): XR console connection lost to port 9001
CRS1000v

Code: Select all

*May 11 18:24:10.581: %PLATFORM-4-ELEMENT_WARNING: R0/0: smand: RP/0: 5-Minute Load Average value 9.49 exceeds warning level 8.00.
*May 11 18:24:44.445: %EVENTLIB-3-CPUHOG: R0/0: hman: undefined: 1311ms, Traceback=1#08ca21ba637c850b75436450ffff3b6d   c:7FA1A2665000+37370 c:7FA1A2665000+15BC9C :564DD7383000+2CDCA :564DD7383000+2D518 :564DD7383000+49343 uipeer:7FA1ACCD2000+3F6A9 uipeer:7FA1ACCD2000+1ED06 evlib:7FA1AE2F7000+9145 evlib:7FA1AE2F7000+9A9C orchestrator_lib:7FA1A94CC000+CE31 orchestrator_lib:7FA1A94CC000+CDB4
*May 11 18:24:44.472: %EVENTLIB-3-CPUHOG: R0/0: hman: undefined: 1135ms, Traceback=1#08ca21ba637c850b75436450ffff3b6d   c:7FA1A2665000+37370 c:7FA1A2665000+EACA4 c:7FA1A2665000+7BCFB c:7FA1A2665000+7BE9D c:7FA1A2665000+6FFA2 procmib_lib:7FA1A7581000+6472 :564DD7383000+4FAB4 evlib:7FA1AE2F7000+9145 evlib:7FA1AE2F7000+9A9C orchestrator_lib:7FA1A94CC000+CE31 orchestrator_lib:7FA1A94CC000+CDB4
*May 11 18:25:00.072: %EVENTLIB-3-CPUHOG: R0/0: smd: write asyncon 0x55df3a8908e8: 136ms, Traceback=1#aacc8f6f6ff3ee394cf2c4311553234a   c:7F38368EF000+37370 pthread:7F3836AAF000+117FA bipc:7F384D54A000+5192 evutil:7F385A7E4000+9CD2 evlib:7F385B6CB000+8D8E evlib:7F385B6CB000+9A9C orchestrator_lib:7F385B4A7000+CE31 orchestrator_lib:7F385B4A7000+CDB4 luajit:7F3837461000+7C696 luajit:7F3837461000+35C44 luajit:7F3837461000+BFF9
Anyone has tried use newer images of cisco XRv9 and CSR1000v ??
It is posible that the problems were related with the storage ?? (the eve VM its located in a Vmware Datastore, not in the local storage)

I need to do some test with IS-IS (migration from OSPF), MVPN control-plane and if it's posible, SR-MPLS.

Regards

Uldis (UD)
Posts: 5067
Joined: Wed Mar 15, 2017 4:44 pm
Location: London
Contact:

Re: Cisco images failure after few minutes

Post by Uldis (UD) » Fri May 13, 2022 6:40 am

How many CPU are assigned for your EVE in total?

Show EVE CLI output

Code: Select all

eve-info

User avatar
bgp-lu
Posts: 5
Joined: Thu Jan 09, 2020 7:26 pm

Re: Cisco images failure after few minutes

Post by bgp-lu » Fri May 13, 2022 1:00 pm

Hello Uldis,

here is the output for that command:

Code: Select all

---------------Packages Installed----------------
ii eve-ng 2.0.3-112
ii eve-ng-addons-ostinato-drone 2.0.3-61
ii eve-ng-dynamips 2.0.2-2
ii eve-ng-guacamole 2.0.3-112
ii eve-ng-qemu 2.0.5-24
ii eve-ng-schema 2.0.6-14
ii eve-ng-vpcs 1.0-eve-ng
ii linux-headers-4.9.40-eve-ng-ukms+ 4.9.40-eve-ng-ukms-brctl
ii linux-image-4.20.17-eve-ng-ukms+ 4.20.17-eve-ng-ukms-brctl

---------------Hostname--------------------------
   Static hostname: cochambre
    Virtualization: vmware
  Operating System: Ubuntu 16.04.7 LTS
            Kernel: Linux 4.20.17-eve-ng-ukms+
      Architecture: x86-64
---------------Disk Usage------------------------
Filesystem                    Size  Used Avail Use% Mounted on
udev                           61G     0   61G   0% /dev
tmpfs                          13G   24M   13G   1% /run
/dev/mapper/eve--ng--vg-root  228G  126G   93G  58% /
tmpfs                          61G     0   61G   0% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
tmpfs                          61G     0   61G   0% /sys/fs/cgroup
/dev/sda1                     472M  118M  330M  27% /boot

---------------CPU Info--------------------------
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             16
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               2394.230
BogoMIPS:              4788.46
Virtualization:        VT-x
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15
NUMA node1 CPU(s):     16-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat

---------------Memory Info-----------------------
              total        used        free      shared  buff/cache   available
Mem:           121G         27G         89G         34M        4.9G         92G
Swap:          8.0G          0B        8.0G

---------------Nic Info--------------------------
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master pnet0 state UP mode DEFAULT group default qlen 1000

---------------IP Info---------------------------
*        State: n/a

---------------Bridge Info-----------------------
pnet0           8000.0050568adc47       no              eth0
pnet1           8000.000000000000       no
pnet2           8000.000000000000       no
pnet3           8000.000000000000       no
pnet4           8000.000000000000       no
pnet5           8000.000000000000       no
pnet6           8000.000000000000       no
pnet7           8000.000000000000       no
pnet8           8000.000000000000       no
pnet9           8000.000000000000       no

---------------H/W Accel-------------------------
INFO: /dev/kvm exists
KVM acceleration can be used

User avatar
bgp-lu
Posts: 5
Joined: Thu Jan 09, 2020 7:26 pm

Re: Cisco images failure after few minutes

Post by bgp-lu » Thu May 26, 2022 3:16 pm

Any clue ??

My intention is deploy this topology in the eve-community version and if it's works, migrate it to a pro version.

On the other hand, I'll try to improve the hw specs...knowing that I'm limited currently

Uldis (UD)
Posts: 5067
Joined: Wed Mar 15, 2017 4:44 pm
Location: London
Contact:

Re: Cisco images failure after few minutes

Post by Uldis (UD) » Fri May 27, 2022 8:30 am


hugodantas2909
Posts: 5
Joined: Sun May 01, 2022 1:25 am

Re: Cisco images failure after few minutes

Post by hugodantas2909 » Tue May 31, 2022 8:04 pm

I have same problem, but it only happens with XR-9Kv. I don't have any problem with CSR1Kv and XRv images. Same exact error (lost many heartbeat).
Problem when it happens I lose whole non-committed configuration and when device comes back, it shows up only loopbacks and mgmt interfaces, all the others disappear.

Hypervisor: VMware ESXi, 7.0.2, 17867351
Model: UCSC-C220-M5SX
Processor Type: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Logical Processors: 96
Memory: 368GB

I have dedicated 80 logical CPUs to EVE and 320GB of RAM - 800GB of disk space.

stsfred
Posts: 9
Joined: Sun May 28, 2017 4:19 pm

Re: Cisco images failure after few minutes

Post by stsfred » Thu Jun 02, 2022 5:38 pm

disable UKSM and watch for RAM consumption. With 110GB RAM allocated to EVE, using xrv9k v7.3.x, only 5 nodes are running fine. Whenever I start the 6th node, after few minutes, I get the same error messages then the nodes restart randomly. Using 6.5.x, it was possible to run 6 nodes. Allocate 20GB RAM for xrv9k nodes, maybe more depending on the features you use.

I didn't have such problem with csr1000v though, if 4GB RAM was allocated to each node.

hugodantas2909
Posts: 5
Joined: Sun May 01, 2022 1:25 am

Re: Cisco images failure after few minutes

Post by hugodantas2909 » Thu Jun 02, 2022 10:01 pm

Just as a feedback. I disabled UKSM and it looks like the problem is solved. I ran 25 XR-9Kv on version 6.5.3. I made lab for 4 hours without any reboot or "VM losing heartbeat" error.

from my 320GB dedicated to EVE-NG, the level of memory was around 51%. I turned on more 10 XR-9Kv and 12 CSR1Kv. Level of memory was up to 72%, but CPU raised to 97% and the lab started to become slow. After 30 minutes, I dropped these last 22 devices, but I didn't have any reboot during the time that I had 49 devices up.

Conclusion is that disabling UKSM is the key to solve this issue. You can disable from GUI and Cookbook shows how to do it, although, recommend to keep enabled.

The unique issue that I still had was about the startup configuration. Some devices didn't bring interface configuration from the startup and I had to make it manually.

Thanks for helping with this.

Post Reply