Unified MPLS: Advanced Scaling for Core and Edge Networks



Service Providers (SPs) are striving towards becoming

'Experience Providers' while offering many residential and/or commercial

Many SPs have to build an agile Next Gen Networks (NGN) that can optimally deliver the 'Any Play' promise.

as the Networks continue to get are getting bigger, fatter and richer,

some of the conventional wisdom of designing IP/MPLS networks is no

longer sufficient.

This introduces a 'Cisco Validated Design' for

building Next-Gen Networks' Core and Edge. It briefly discusses the

technologies integral to such a design and focus on their implementation

using IOS-XR platforms (CRS-1/3/X and ASR 9000). This looks at the

scaling designs and properties of IP, MPLS, the IGP and BGP as well as

the protection mechanisms IP/LDP FRR and MPLS-TE FRR.

This is intended to cover:
+ Unicast routing + MPLS design
+ Fast Restoration
+ Topology Dependency
+ Test Results
+ Case Study

+ Networks becoming larger
- Quad-play (Video, Voice, Data & Mobility)
- Merger & Acquisition
- Growth
+ Exponential bandwidth consumption
- Business Services
- Mobile
+ MPLS in the Access
- Seamless MPLS
+ BGP ASN consolidation
- Single ASN offering to customers

NGN Requirements:
+ Large Network
- 2000+ routers, say
+ Multi-Play Services Anywhere in network
- Service Instantiation happens anywhere
+ End-to-End Visibility
- v4/v6 Uni/Multicast based Services
+ Fast Convergence or Restoration
- Closer to Zero loss, the better

+ Scale & Performance

Solution Overview:
+ Unicast Routing + MPLS - Divide & Conquer
1. Isolate IGP domains
2. Connect IGP domains using BGP
+ Fast Restoration - Leverage FRR
+ Topological Consideration - Choose it right
1. PoP Design
2. ECMP vs. Link-Bundling
+ Services - Scale

Routing + MPLS Design

Must Provide:
+ PE-to-PE Routes (and Label Switched Paths)
- PE needs /32 routes to other PEs
- PE placement shouldn't matter
+ Single BGP ASN

Conventional Wisdom Says:
+ Advertise infrastructure (e.g. PE) routes in IGP
+ Advertise infrastructure (e.g. PE) labels in LDP
+ Segment IGP domains (i.e. ISIS L1/L2 or OSPF Areas)

Conventional Wisdom Not Good Enough:
+ Large IGP database size a concern
- For fast(er) convergence
+ Large IGP domain a concern
- For Network Stability
+ Large LDP database a concern

'Divide & Conquer' - Game Plan:
+ Disconnect & Isolate IGP domains
- No more end-to-end IGP view
+ Leverage BGP for infrastructure (i.e. PE) routes
- Also for infrastructure (i.e. PE) labels

'Divide & Conquer' - End Result:


Example - 'PE31' Reachability:
+ Control Plane Flow - RIB/FIB Table View
+ Data Plane Flow - PE11 to PE31 Traffic View

Divide & Conquer - Summary:
1. IGP is restricted to carry only internal routes
- Non-zero or L1 area carries only routes for that area
- Backbone carries only backbone routes (ISIS Backbone Would Carry Both L1 and L2 Routes Since L1->L2 (or L1->L1) Redistribution Cannot Be Avoided Yet, but OSPF Non-ZeroZero Area Redistribution Can Be)
2. PE redistributes its loopback into IGP as well as iBGP+Label
3. PE peers with its local ABRs using iBGP+label
- ABRs act as Route-reflectors
- ABRs reflect_only_Infrastructure (i.e. PE) routes
- RRs also in the backbone
4. ABR, as RR, changes the BGP Next-hop to itself
- On every BGP advertised routes
5. PEs separately peer using iBGP for Services (VPN, say)
- Dedicated RRs for IPv4/6, VPNv4/6, L2VPN, etc.

Divide & Conquer - End Result:


Example - 'L3VPN Services'
+ PE11 send L3VPN traffic for an L3VPN prefix "A" to PE31

+ Higher Network scale is attainable
- 1000s of routers
+ BGP and MPLS Label Stacking are key

Fast Restoration:
+ Business Services demanding faster restoration
- Against link or node failures
+ "Service Differentiator" for many operators
+ Faster Restoration is driving towards 0 loss
- ~50ms restoration may be good enough for many

- Requirements influence Complexity and Cost
+ Fast Restoration is optimal with "Local Protection"
- pre-compute and pre-install alternate path
- no need for remote nodes to know about the failure

+ Fast Restoration of Services i.e. BGP Prefixes
- BGP Prefix Independent Convergence (PIC)
+ Fast Restoration of BGP next-hops i.e. IGP Prefixes
+ Fast Convergence (FC) of IP routing protocols is key and still required

Fast Convergence

IGP Prefixes:
+ Remember that FRR is intended for temporary restoration
+ Fast Convergence (FC) is key for IP routing protocols
+ Faster the routing convergence, faster the permanent restoration



IPFRR: Network Availability and Simplicity

Fast Reroute Requirements:

Convergence: Impact of Outage on Video


Video artifacts. With a slice error, we can see the image (a) as a viewer would see it and (b) with the parts in error highlighted. With a blocking or pixelization error (c), the effect occurs when a loss occurs in either an I- or P-frame. (Source material copyright the Society of Motion Picture Television Engineers.)


If we have a 10mS outage, the best we can expect is a 33mS disruption because we lose a b-frame and the worst case is because we got unlucky and lost an I-frame. The worst case loss for low motion is longer than the worst case loss for high motion: because there are more frames in high motion.



+ Assume a flow from A to B
+ T1: when L dies, the best path is impacted
- loss of traffic
+ T2: when the traffic reaches the destination again through the computed next best path.
- If fast reroutes technologies are used, this may happen well before the network convergence
- Once the network converges, a next best path is computed
+ Loss of Connectivity: T2 - T1, called "convergence" hereafter
+ Traffic can be restored long before the convergence time if fast reroute technology is used

Fast Convergence & Fast Reroute:
+ Minimize network downtime/traffic loss
- "Classical" Convergence > 1 sec.
- Fast Convergence < 1 sec.
- Fast Re-Route < 50-100 msec.
+ Support all types (Link, Node or SRLG) of IP/MPLS restoration mechanisms.
+ Keep it simple and straight.
+ Keep it cost effective (both capex/opex)

Classical and Fast Convergence:
+ Detection (link or node aliveness, routing updates received)
+ Walk through routing DB's
+ State propagation (routing updates send)
+ Compute primary path & label
+ Download to HW FIB
+ Switch to newer path

Fast Reroute - Path Precomputed more action from Classical and Fast Convergence:
+ Switch to Repair Path
+ Pre-Compute Repair path (Offline Calculation)
+ Download to HW FIB (Offline Calculation)

BGP PIC (Prefix Independent Convergence):
+ What is it, and why?

+ PIC is the ability to restore forwarding without resorting to per prefix operations.
+ Loss Of Connectivity does not increase as network grows (one problem less).

BGP Recursion:
+ show ip route
*, from, 00:01:20 ago
+ show ip route
+ show ip cef

Non Optimal: Flat FIB:
+ Each BEG FIB entry has its own local Outgoing Interface (oif) information
+ Forwarding Plane must directly recurse on local oif information
+ FIB changes can take long, dependent on number of prefixes

Right Architecture: Hierarchical FIB:
+ Pointer Indirection between BGP and IGP entries allow for immediate leveraging of the IGP convergence, and immediate update of the multipath BGP pathlist at IGP convergence
+ Only the parts of FIB actually affected by a change needs to be touched
+ Used in newer IOS and IOS-XR (all platforms), enables Prefix Independent Convergence

Failure in the Core:
+ Address failures "in the core" where the recursive BGP path stays intact.
- Failures covered are P-PE link or P node failures that trigger a change of the IGP path to the BGP next-hop.


+ IGP convergence on PE1 leads to a modification of the RIB path to PE3.
- BGP Dataplane Convergence is finished assuming the new path to the BGP nhop is leveraged immediately

Hierarchical FIB:
+ FIB Leaf: group of prefixes
+ BGP Path-List: list of best ECMP BGP nhops and list of alternate BGP nhops
+ IGP Path-List: list of ECMP IGP paths
+ Adjacency: OIF and immediate nhop

+ As soon as IGP converges (200msec), the IGP PL memory is updated and hence all children BGP PL's leverage the new path immediately
+ Optimum convergence, Optimum Load-Balancing, Excellent Robustness

PE Node Failure:
+ Addresses a change in the BGP path
+ i.e. a change to a different BGP next-hop due to a PE node failure, which normally would require network wide BGP best-path re-computation and path withdrawing


+ BGP Dataplane Convergence is kicked in on PE1 and immediately redirects the packets via PE4 using a pre-calculated alternate (repair) path.

+ PE1 has primary and backup path
- Primary via PE3
- Backup via PE4 best external route

+ IGP propagates loss of PE3's/32 host route across the core to remote PEs

+ PE1 detects loss of PE3's/32 host route in IGP
- CEF immediately swaps forwarding destination label from PE3 to PE4 using backup path
+ BGP NHT sends a "delete" notification to BGP which triggers BGP Control-Plane Convergence
- BGP on PE1 computes a new bestpath later, choosing PE4

PE-CE Link Failure:
+ PE1 and PE3 are the reacting points
+ Enhancement to the MPLS VPN BGP Local Convergence feature
+ Improvement by calculating a backup/alternate path in advance
+ When primary link PE3 - CE2 fails:
- Data Plane: The traffic is sent to the backup/alternate path
- Control Plane: PE1 is expected to converge to start using PE4's label to send traffic to 110.x.0.0/24

+ PE3 has primary and backup path
- Primary via directly connected PE3-CE2 link
- Backup via PE4 best external route
+ What happens when PE3-CE2 link fails?

+ CEF (via BFD or link layer mechanism) detects PE3-CE2 link failure
+ CEF immediately swaps to repair path label
- Traffic shunted to PE4 and across PE4-CE2 link

+ PE3 withdraws route via PE3-CE2 link
+ Update propagated to remote PE routers

+ BGP on remote PEs selects new best path
+ New best path is via PE4
+ Traffic flows directly to PE4 instead of via PE3

Loop Free Alternate (LFA) Key Concepts

Why Not Just Use Fast Convergence:
+ ISIS/OSPF and CEF can be very fast!
- 200ms on high end platform can be achieved.
+ But...
- It runs at the process level
Does not guarantee time limit
- Performance depends on tuning and platform implementation

What is an LFA?
+ Stands for Loop Free Alternate
- A node other than the primary next hop
+ Provides local protection for unicast traffic in pure IP (and MPLS/LDP) networks in event of a single failure, whether link, node, or shared risk link group (SRLG)
+ Traffic is redirected to the LFA almost immediately after failure
+ An LFA takes forwarding decision without knowledge of the failure
- LFA must not use the failed element to forward the traffic
- LFA must not use the protecting node to forward traffic
- LFA must not cause loop

Per-Link LFA Protection:
+ Goal is to bypass failed link and reach primary node via alternative way
+ Main Idea: We know there exists good path from primary node to all destinations, so if we can bypass failed link and deliver traffic to router which was next hop of primary path before link failure then we know that router can forward it further

Per Link LFA Limitations:
Per-Link LFA Does Not Work in Some Cases




Scope of Orchestration:

VMS 1.0.2 Services:


VMS 2.0 Services (Added):
+ 4000 Series in CPU
+ Intrusion Prevention (IPSv)

VMS 2.1 Services (Added): Cloud VPN "as a Service"

VMS 2.2 Services:


Delivering services to the branch:

Today's approaches: Rack and Stack
+ Best in breed
+ Customer choice
+ Modular build-out
+ Environmental (space/power/wiring)
+ Onsite + complex installation
+ Truck rolls

Integrated Branch Solution:
+ Fully integrated solution
+ No truck roll
+ Simpler environmental
+ Reduced customer choice
+ Upfront hardware investment
+ Software inter-dependencies

What is vBranch Orchestration:
+ Centrally orchestration branch level NFV solution
+ Central portal Infrastructure
+ NFV orchestrator - NCS
+ VNF EMS / NMS / Controller - choice
+ Elastic Services Controller @ branch
GUI + Local life cycle management
+ x86 capability at the branch


Customer Experience in Brief:


Self-Service User and Operator Portals - Customizable:


Cisco Virtual Managed Services
Cloud VPN and Cloud MPLS Packages:


Application Policy Model and Instantiation:


All forwarding in the fabric is managed through the application network profile
+ IP addresses are fully portable anywhere within the fabric
+ Security and forwarding are fully decoupled from any physical or virtual network attributes
+ Devices autonomously update the state of the network based on configured policy requirements

Cisco ACI Introduces Logical Network Provisioning of Stateless Hardware:



Infrastructure Language App Language

Infrastructure Language:
+ IP Address
+ Subnets
+ Firewalls
+ Quality of Service
+ Load Balancer
+ Access Lists

App Language:
+ Application Tier Policy and Dependencies
+ Security Requirements
+ Service Level Agreement
+ Application Performance
+ Compliance
+ Geo Dependencies

APIC-EM: Common Policy Model from Branch to Data Center:


Ultra Service Platform: From Physical to Virtualized Mobile Networks:


Agile Carrier Ethernet - ACE:


Minimal but "Sufficient" distributed control plane on network nodes
Centralized intelligence on the SDN service controller

+ Transport: Autonomic self-deployed and self-protected, dynamic, ECMPs, flexible traffic engineering
+ Services: SDN + BGP for service, programmable

+ Autonomic Networking
- Virtual Out of Band Channel Autonomic Control Plane
- Secure & Zero Touch deployment
- Auto IP / IP unnumbered
+ Segment Routing
- Reduced Protocols
- Application Integration
- Simplified TE
+ SDN Orchestration
- NSO / Tail-F for Service and static Label provisioning
- XRv for central control plane
- Open SDN Controller and WAE as add-ons for SR TE

Autonomic Networking: Secure, Plug-n-Play:
+ Plug-n-Play: New node use v6 link local address to build adjacency with existing nodes, no initial configuration is required
+ Secure: New node is authenticated using its ID, and then build encrypted tunnel with its adjacent nodes
+ Always-on VOOB: Consistent reachability between Controller and network devices over Virtual Out-of-band management VRF. Even with user mis-configuration, the VOOB will still remain up

Transport Evolution with Segment Routing (SR):
+ Application Enabled Forwarding
- Each engineered application flow is mapped on a path
- A path is expressed as an ordered list of segments
- The network maintains segments
+ Simple: less Protocols, less Protocol interaction, less state
- No requirement for RSVP, LDP
+ Scale: less Label Databases, less TE LSP
- Leverage MPLS services & hardware
+ Forwarding based on Labels with simple ISIS/OSPF extension
+ 50msec FRR service level guarantees
+ Leverage multi-services properties of MPLS

Millions of Applications flows ->
A path is mapped on a list of segments ->
The network only maintains segments
No application state

The state is no longer in the network but in the packet

ACE Transport: Unified MPLS with Segment Routing
Unified MPLS with SR


50 ms Switch-over Time:
เนื่องจากเราทำงานกันบน Protocol IP แหล่งแรกๆ ที่เราจะวิ่งไปหาคือ RFC

RFC 3469 "Framework for Multi-Protocol Label Switching (MPLS)-based Recovery" กล่าวไว้ว่า

Fastest MPLS recovery is assumed to be achieved with protection switching and may be viewed as the MPLS LSR switch completion time that is comparable to, or equivalent to, the 50 ms switch-over completion time of the SONET layer.

นั่นคือจะทำ Protection Switching ที่เร็วที่สุด... เร็วแค่ไหน... ก็ภายใน 50 ms เทียบเท่ากับ SONET นั่นแหละ

แล้วมีข้อจูงใจอื่นๆ ที่บอกว่า IP ต้อง Switch เร็วภายใน 50 ms ไหม... เท่าที่ผ่านมายังไม่เคยเห็น
จริงๆ RFC ฉบับที่เป็น Standard ของ MPLS-TP เจาะจงเลยว่าอ้างอิง G.841 แต่บางคนอาจจะบอกว่าส่วนใหญ่เราไม่ได้ใช้ MPLS-TP กัน เลยเอา RFC ของ MPLS ทั่วๆ ไป คือ RFC 3469 มาก่อน

ก็ลองไปดู ITU-T G.841: Types and characteristics of SDH network protection architectures

เขาเขียนไว้หลายที่ว่าการ Detect หลายๆ อย่างกำหนดระยะเวลาภายใน 50 ms ไม่งั้นกลไกอื่นๆ ก็จะเกิดขึ้นเป็นทอดๆ ต่อไป ซึ่งเดาเอาเองว่าการที่กำหนดให้ Switch ภายใน 50 ms น่าจะเพื่อป้องกันไม่ให้กลไกเหล่านั้นถูก Trigger โดยไม่จำเป็น ซึ่งจะส่งผลกระทบต่อ Service อย่างมีนัยสำคัญ

ในส่วนที่เกี่ยวกับความหมาย มีเขียนไว้ดังนี้ (ตัดมา 3 ท่อน จากหัวข้อต่างๆ กัน เน้นที่คำว่า Switch Completion Time ที่บอกว่าต้องไม่เกิน 50 ms

“ switch completion time: The interval from the decision to switch to the completion
of the bridge and switch operation at a switching node initiating the bridge request. “

“ Protection switch completion time excludes the detection time necessary to initiate
the protection switch and the hold-off time. ”

“Switch time – In a ring with no extra traffic, all nodes in the idle state (no detected failures,
no active automatic or external commands, and receiving only Idle K-bytes), and with less
than 1200 km of fiber, the switch (ring and span) completion time for a failure on a single
span shall be less than 50 ms.”

สรุปคือไม่นับรวมระยะเวลาที่ใช้ในการ Detect และ Hold-off Timer ในที่นี้ที่เราอาจจะสนใจมากหน่อยคือ Detection Time ที่เราอาจจะใช้ BFD หรือใช้ Fault Propagation Mechanism ของอุปกรณ์ DWDM เป็นต้น ซึ่งตามมาตรฐาน ไม่นับรวมใน 50 ms

แต่เวลาทำการทดสอบที่จะวัดกันว่าผ่านหรือไม่ผ่านด้วย Tester มักจะนับกันที่ระยะเวลาที่ Loss ซึ่งมันรวมเอา Detection Time เข้าไปด้วย

ซึ่งมันก็ไม่ได้ผิดที่จะกำหนด Target อะไรก็ได้ตามใจ แต่มันจะผิดถ้าบอกว่าเป็นข้อบังคับตาม Standard
แล้วกลไกอะไรที่ตอบโจทย์ว่า Switch-over Time (หรือ Switch Time, Switch Completion Time) จะอยู่ภายใน 50 ms

คำตอบก็คือ FRR.... แล้ว FRR คืออะไร จำเป็นต้องเป็น RSVP-TE Tunnel ไหม

จริงๆ FRR มีหลักการง่ายๆ คือ ต้องหา Next-hop สำรองไว้ก่อน ถ้าอุปกรณ์ระดับ Carrier Grade ก็คือ Download ลงไปที่ตัว Hardware ที่ทำการ Forward ข้อมูลรอไว้ให้ Activate ขึ้นมาใช้งานได้ทุกเมื่อ เพราะถ้ารอให้เกิดเหตุแล้วค่อยหา Next-hop ใหม่ แล้ว Update มันจะ Switch ได้ไม่เร็วนัก (ช้ากว่า 50 ms ถ้าเอา Spec ของ SONET/SDH เป็นตัวเทียบ)

แค่ทำ FRR ไม่ว่าจะ BGP FRR, IP FRR, LDP FRR ก็เทียบเท่า SONET/SDH Standard ในเรื่อง Switch-over Time

แต่ถ้าใช้เวลา Detect 3 นาทีแล้วค่อย Switch ... ต่อให้ Switch Completion Time น้อยกว่า 50 ms หรือแม้กระทั่ง 1 ms มันก็ไม่ OK ใช่ไหมล่ะ

ลองหาเกี่ยวกับ Detection Time ของ SONET/SDH ก็พบว่า Detection Time ของ SDH อาจใช้เวลาถึง 10 ms หรือมากกว่า ดังนั้นกว่าจะ Detect จน Switch เสร็จอาจจะใช้เวลา 60 ms หรือมากกว่าก็ได้


"When a fault occurs the node is allowed 10mS to detect the failure and 50mS to make the switch. This is standard for all SONET systems."


According to GR-253 and G.841, a network element is required to detect AIS and initiate an APS within 10 ms. B2 errors should be detected according to a defined algorithm, and more than 10 ms is allowed. This means that the entire time for both failure detection and traffic restoration may be 60 ms or more (10 ms or more detect time plus 50 ms switch time).
สำหรับ Mobile ปัจจุบัน Protocol หนึ่งที่สำคัญคือ SCTP คือ Protocol ที่ใช้สำหรับส่งข้อมูล Signaling ถ้า SCTP Down จะเป็นประเด็นที่ค่อนข้างซีเรียส

การ Detect Failure ของ SCTP เป็นดังนี้

(อ้างอิง: rimmon-essentials.blogspot.com/2008/10/sctp-failure-detection-time.html)

SCTP's multi-homing failure detection time depends on three tunable parameters:

RTO.min (minimum retransmission timeout)
RTO.max (maximum retransmission timeout), and
Path.Max.Retrans (threshold number of consecutive timeouts that must be exceeded to detect failure).

RFC2960 recommends these values:

RTO.min - 1 second
RTO.max - 60 seconds
Path.Max.Retrans - 5 attempts per destination address

If the timer expires for the destination address, set RTO = RTO * 2 ("back off the timer").
The maximum value discussed (RTO.max) may be used to provide an upper bound to this doubling operation.

Since Path.Max.Retrans = 5 attempts, this translates to a failure detection time of at least 63 seconds (1 + 2 + 4 + 8 + 16 + 32).
In the worse case scenario, taking the maximum of 60 seconds, the failure detection time is 360 seconds (6 * 60).

In another example, where the following parameters are used,

RTO.min - 100ms
RTO.max - 400ms
Path.Max.Retrans - 4 attempts

Max. failure detection time = (1 + PMR)* RTO.max = 5*400 = 2,000ms
Min. failure detection time = 100 + 200 + 400 + 400 + 400 = 1,500ms
Cr: P'Pae@Nokia


Best Practices to Deploy High-Availability in Service Provider Edge and Aggregation Architectures:
Network Level Resiliency:
  • Network Design Resiliency:
    APS, GEC
  • Event Dampening
  • Fast Convergence:
    iSPF Optimisation (OSPF, IS-IS)
    BGP Optimisation
    Fast BGP Convergence
  • Graceful Restart (MBGP, OSPF, RSVP, LDP)
  • EMCP, Anycast, dual RR
  • MPLS High Availability
  • LDP Graceful Restart
  • BFD (Bi directional Forwarding Detection)
  • MPLS FRR Path Protection
  • MoFRR
  • IP FRR
  • Pseudowire Redundancy
  • Spanning Tree (MST, PVRSTP...)
  • IEEE 802.3ad (LACP)
  • ...
BFD Protocol Overview:
  • Accelerates convergence by running fast keepalives in a consistent, standardised mechanism across routing protocols
  • Lightweight hello protocol
  • Neighbours exchange hello packets at negotiated regular intervals
  • Configurable transmit and receive time intervals
  • Unicast packets, even on shared media
  • No discovery mechanism
  • BFD sessions are established by the clients e.g. OSPF, IS-IS, EIGRP, BGP, ...
  • Client hello packets transmitted independently
BFD Operation Modes:
  • Async Mode:
    • (no echo), periodic control packets sent
    • Neighbour declared dead if no pkt is received for period
  • Echo Mode:
    • Session established using async control session
    • When echo is negotiated, echo packets sent at negotiated rate, used for failure detection
    • Control packets sent at low rate
  • orhanergun.net/2017/08/bfd-is-not-a-fast-convergence-mechanism
  • ccie-in-2-months.blogspot.com/2013/12/bfd-hints.html