Planning and Design Service Provider Network

PlAwAnSaI · Apr 24, 2011

"There are lots of peer technical CCIE tracks, but CCDE was the only one that matched my strategic design role. It is unique, vendor-neutral course, based on universal design principles and technologies that can be applied to solving comprehensive business needs."
Tony Brown, Enterprise Systems Architect - Verizon

VRRP:
Transparent default gateway redundancy
Virtual IP address can also be a real address
IETF standard, so use VRRP if you need multivendor or interoperability
Preempt is enabled by default
Default Hello timer 1 seconds
VRRP use 1 Virtual IP and 1 Virtual MAC address for gateway functionality.

Business Drives Technology:

Cost
- Capital (CAPEX)
- Operational (OPEX)
Flexibility/Agility
- Changes in Business
- Manageable

Security

Modularity:

Modularity
- The first key concept in network design
- Can be horizontal or vertical

Resilience
- Decouples devices in different modules
- Decreases MTTR
Manageability
- Repeatability

Security Observation Points:

Data Exfiltration
Unusual Traffic Patterns
Failed Sign-ins, Other Signs...
Observe
- Observe what's going on
- Know what's normal
Orient
- Understand the context
- Understand the intent
Decide
- Evaluate possible courses of action
- Decide what action to take

ACT > Execute

Enabling OODA with Design:

"Chokepoints" between modules...
Make great observation points
- To understand "normal"
- To orient to real targets/goals
Make great decision points
- To limit which services will be impacted by actions
- To determine where to act and how

Make great action points
- To pre-stage policy
- To focus policy at specific, identifiable points

Applications and Availability:

Traffic Engineering
- Ample Bandwidth
- Minimal Delay
Low Jitter
- Consistent Path
- Fast Convergence

Key Resilience Terms:

MTBF: Mean Time Between Failures
MTTR: Mean Time to Repair
- Time until traffic is flowing
- Time until network is "as designed"
Reliability
- "9's of availability"
a = uptime / (uptime + downtime (as measured))
- 525,600 minutes in a year
  4 minutes downtime
  525,596 minutes uptime
  a = 525,596 / (525,596 + 4)
  a = .99999239...
  a = 99.999%
- 99.999% ("five nines"):
  Downtime/year: 5.26 minutes
  Downtime/month: 25.9 seconds
  Downtime/week: 6.05 seconds
- 99.9999% ("six nines"):
  Downtime/year: 31.5 seconds
  Downtime/month: 2.59 seconds
  Downtime/week: 604.8 milliseconds
- 99.99999 ("seven nines"):
  Downtime/year: 3.15 seconds
  Downtime/month: 262.97 milliseconds
  Downtime/week: 60.48 milliseconds
a(proj) = time period / (time period + downtime (proj))
downtime(proj) = (time period / MTBF) * MTTR
MTTR = downtime (as measured) / number of failures

Failure:

Discover:
- How long does it take to discover the failure?
- Protocol neighbor liveness
Report:
- How long does it take to spread the news?
Calculate:
- How long does it take to find a new path?
Install:
- How long does it take to change to the new path?
- Protocol interaction with the RIB/FIB

Protocol:

Protocol Hellos => Protocol Process => Fast
BFD Hellos => BFD Process => Protocol Process => Faster
? => Interface Phy/Processor => Forwarding Plane/RIB => Protocol Process => Fastest

Failure Domains:

Modularity
- The first key concept in network design
- Can be horizontal or vertical
Resilience
- Failure Domains = Broadcast Segment
- Decouples devices in different modules
- Decreases MTTR

Hiding Information:
Five ways to hide information in the control plane

Aggregation
Summarization
Filtering
Virtualization
Caching
There are others, but not widely deployed

Leaky Abstraction:

IP Address carried in the packet payload
IP address as a host identifier
Dropped IP packets in a TCP stream
Tunnel failure on underlay failure
Jitter on control plane path change

Design Patterns:

Put each in a single layer:

Forwarding:
Carry traffic between modules, topological areas, geographical regions, etc.
Aggregation:
Combine lots of smaller links into a smaller number of larger links; provide paths for engineering and virtualization
Policy:
Engineer traffic and control access
- Aggregation
- Traffic Engineering
- Source-based Routing
- Service Chaining
- Load Balancing
- What do all of these have in common?
- They each have the potential to increase stretch
Control plane policy is anything that modifies the forwarding path (potentially of the shortest path) to achieve a specific goal
Admittance:
Attach users, control access, classify traffic, terminate virtual overlays
- Attach users
- Control access
- Classify traffic
- Terminate virtual overlays
- Includes any edge security policies
- Includes any edge QoS policies
- Towards a host, for instance...
  - Unicast RPF filtering
  - MAC Address filtering
  - AAA controls
  - Quality of service marking
- Towards an external network
  - Bogon filtering
  - Unicast RPF (if possible)

Core => Forwarding
Distribution => Aggregation & Policy
Access => Admittance
Core => Forwarding & Policy
Aggregation => Aggregation & Admittance
Core => Forwarding
Aggregation => Policy & Aggregation & Admittance
- Core => Policy & Aggregation
- Aggregation => Admittance

Common Topologies:

Degree of Connection
Regularity
Path Characteristics
- Longest Path
- Shortest Path
Convergence Characteristics
Troubleshooting Characteristics
Flexibility

Hybrid Device Model:

A hybrid of protocol operation and network device operation models:
- DARPA four-layer model with the network layer broken apart
- Adds in network device (router) pieces
Includes the control plane:
- Control plane protocols normally fall "outside" layered models
Application:
- Uses information transported across the network
- Presents data through formatting and marshalling
- Can provide all four services "over the top"
- Primary consumer and producer of data
Transport: TCP/UDP, etc.
- Flow control
- Error correction
- Application multiplexing
- Quality of service
- Helping fairness in the transport (WRED)

PlAwAnSaI · Sep 4, 2017

Hybrid Device Model:

Network: IP
- End-to-end transport (including addressing)
- Transport multiplexing
Link: FIB
- Single hop transport (including addressing)
- Media access flow control
- Media access framing and marshalling
- Per-hop (media specific) error correction
- Fast forwarding in soft real-time
- Consuming and producing packets
Control Plane: RIB
- Discovering end-to-end reachability
- Providing forwarding information to data plane (link level) and virtualization
- Consuming reachability information provided by the data plane (virtualization and link)
- Consuming addressing and other information provided by the network
- Providing mapping between locations and identities to the transport
- Routing protocols
- Intersection of naming and addressing (DNS)
SDN adds a relationship between the application and the control plane
Virtualization:
- Providing tunnel headers to the data plane (link)
- Providing virtual topology state to the control plane
- Consuming link-state information from the data plane (link)
- Consuming reachability information from the control plane
- Tunnel/overlay headends and tailends
...complexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts
www.ietf.org/rfc/rfc3439.txt
In this model, the thin waist of the hourglass is envisioned as the (minimalist) IP layer, and any additional complexity is added above the IP layer. In short, the complexity of the Internet belongs at the edges, and the IP layer of the Internet should remain as simple as possible.
www.ietf.org/rfc/rfc1925.txt
It is always possible to add another level of indirection.

Understanding Complexity Tradeoffs:

State:
- How often state changes?
- How rapidly state changes?
- How much state there is?
- How often the set of reachable destinations changes?
- How often the state of a link changes?
- How quickly is a link state change flooded?
- How quickly is a failed link detected?
- How many routes are in the routing table?
- How many policies are configured on a device?
Optimization:
- Network utilization
- Optimal path
- Application support
- What is the excess capacity across the network?
- What is the utilization of this link?
- What is the stretch?
- What is the jitter?
- What is the delay?
- How long does it take to converge?
Surface:
- How deep?
- How broad?
- How many redistribution points?
- How many places is the same policy configured?
- One link failure impacts many virtual topologies (fate sharing)
No Aggregation - Optimal routing - Optimization:
- More control plane state - State
- Changes often - State
- No configuration - Surface
Aggregation - Suboptimal routing - Optimization:
- Less control plane state - State
- Changes less often - State
- Configured aggregation - Surface

If you haven't found the tradeoffs, you haven't looked hard enough.Link State Review

ijkstra's Algorithm example
https://www.youtube.com/watch?v=5GT5hYzjNoo
Splitting Flooding Domains:

What are technical reasons there for splitting a flooding domain?
Essentially... None
This doesn't mean shouldn't split flooding domains...
- But technical reasons are generally a factor
Still - what technical reasons are there?
SPF runtime
- Solid workload indicator
- 100ms - the rule of thumb
Incremental/Partial SPF
Exponential Backoff
Reduce the size of the link state database
If all else fails - split the flooding domain
Non-technical Reasons:
- Providing good measurement points
  - Troubleshooting, security, etc.
- Break up failure domains
  - Prevent massive failures in one part of the network from impacting the rest of the network
- Provide policy choke points
- Policy and management should drive splitting flooding domains
  - Not "because it's too big"
  - Be intentional and thoughtful here

Optimizing Link State Convergence:

Why?:
- Reduce convergence time
- Allow for larger flooding domains
Optimization:
- Optimize timers
- Reduce scale of SPF runs

LFA:

Useful for three-hop rings (triangles)
Little/no additional state
Trivial computation costs

Remote LFA:

Prevents micro-loops in larger rings
Requires tunnels
Tunnel information must be carried somehow
More complex computations

Fast Convergence Considerations:

Discover:
- How long does it take to discover the failure?
- IS-IS neighbour liveness
Report:
- How long does it take to spread the news?
- IS-IS flooding process
Calculate:
- How long does it take to find a new path?
- IS-IS shortest path tree
Install:
- How long does it take to change to the new path?
- IS-IS interaction with the RIB/FIB

Convergence Optimization:

Tuned Flooding and SPF Calculation
Loop-Free Alternates
Ordered Route Installation
Remote Loop-Free Alternates
Wait for BGP

IS-IS Metrics and Externals:

RFC5305:

draft-IETF-isis-prefix-attributes:

IS-IS IPv6:

Deploying IS-IS Metrics:

Always use wide metrics
- If you're using narrow metrics, transition to wide metrics
There is no auto-cost
- Keep metrics consistent by using a formula
- For instance, 1,000,000/bandwidth(kb)
- Set the default metric on all intermediate systems to something high
- Like 100,000
- the metric-default command in IOS, for instance
Tune metrics on individual links sparingly, and document tuning well
Make certain to use symmetric metrics

BGP Advertisement Rules:

Routes learned through eBGP can be advertised to all peers
Routes learned through iBGP can only be advertised to eBGP peers
- iBGP will not carry routes over more than one hop
- Hence - iBGP speakers must be configured in a (logical) full mesh

PlAwAnSaI · Dec 4, 2017

Reflection Rules:

Routes received from clients reflected all peers
Routes received from non-clients reflected clients
Reflect best paths only
Do not change any attributes (including the next hop)
If a route is received with "my" originator-ID, discard the route
If a route is received with "my" cluster-ID in the cluster list, discard the route

EPN 4.0 Transport Infrastructure Design and Implementation Guide:
www.cisco.com/c/dam/en/us/td/docs/solutions/Enterprise/Mobility/EPN/4_0/EPN_4_Transport_Infrastructure_DIG.pdf

Evolving Technologies Study Guide:
learningnetwork.cisco.com/docs/DOC-31004

lostintransit.se/category/ccde/page/3

learningnetwork.cisco.com/community/learning_center/ccde-training-videos

Planning and Design Service Provider Network

PlAwAnSaI

Administrator

PlAwAnSaI

Administrator

PlAwAnSaI

Administrator