It’s time to talk about good ole’ TCP. Assuming that you’ve already read the above joke about TCP, it’s a good time to start into a disclaimer: Other than this little joke, I won’t be going into how the 3-way handshake works. Nor will I be describing the difference between TCP and UDP, or defining OSI Layer 4. This is a CCIE-level blog and it just makes sense to assume that anyone reading these posts already has exposure to those concepts. So let’s move on and get into the weeds about TCP Operations! Here are the exam blueprint topics we’ll be covering next:
1.1.e Explain TCP operations
1.1.e (i) IPv4 and IPv6 PMTU
1.1.e (ii) MSS
1.1.e (iii) Latency
1.1.e (iv) Windowing
1.1.e (v) Bandwidth delay product
1.1.e (vi) Global synchronization
1.1.e (vii) Options
IPv4 and IPv6 PMTU
PMTU is a relatively simple concept. Hosts that support PMTU will set the DF bit in the IP packet header so that routers are instructed not to fragment packets. Most hosts do support PMTU, so the majority of IP packets will indeed have this flag set, and while IPv4 routers do indeed support fragmentation, it’s actually rather uncommon for them to have to do so. If an IPv4 packet is sent that exceeds MTU, the router that discards the packet will send back an ICMP type 3 code 4 “fragmentation needed, but DF bit set” ICMP unreachable message. In that ICMP message, the router will also include its configured MTU. That way, the sending host can retransmit with the appropriate MTU. Remember, just because the DF bit is set, that does not mean that the packet CANNOT be fragmented, just that routers in the transit path are instructed NOT to PERFORM fragmentation of the packet while in transit. It does NOT mean that the router should block fragmented traffic. In actual operation, many hosts are actually performing fragmentation on these packets, and many routers are indeed passing fragmented traffic.
IPv6 PMTUD works similarly, but hosts don’t necessarily have to support it. Per IPv6 RFC specifications, all IPv6 hosts must support an MTU of at least 1280, so hosts could just simply keep all of their packets at or under 1280 bytes and operate under the premise that the packet will not experience MTU issues.
For both protocols, since the MTU could potentially increase mid-conversation, either because of a configuration change or a path change, hosts can occasionally send packets that are bigger than their current session MTU. RFCs dictate that this should happen no more often than once every 5 minutes, and recommend setting it to 10 minutes. This way, hosts are not constantly sending traffic, knowing full well that it will likely get dropped, but they still give themselves a chance at discovering an increase in the MTU of a given conversation.
Because receiving these “MTU exceeded” messages is critical to the operation of hosts that use PMTUD, it is recommended that you do not block them. If you need to block ICMP in general, you can still put a “permit icmp any any unreachable” ACE in front of an ACL that contains a “deny icmp any any” entry.
Because an attacker could potentially send packets that are too large in an attempt to cause a DOS attack (by forcing the router to constantly drop packets and send unreachable messages to a host), Cisco allows administrators to throttle the frequency of these messages with the ip icmp rate-limit unreachable milliseconds command.
TCP MSS is well-defined in RFC 879. And more in RFC2385. Then, RFC6691 comes along and tears those two RFCs apart for containing incorrect statements. I can honestly say I’ve never read an RFC before starting down the CCIE path, and it’s incredibly enlightening (and even slightly entertaining) to have gone through a few now. TCP has a rule that states that the default datagram size to send is 576 bytes. This includes 40 bytes for the IP header and TCP header, so when looking purely at the data inside the L4 header, the default PDU is actually 536 bytes. When a connection is being established, the RECEIVER can indicate via the options field in the TCP header that it is able to accept packets of a different size. If a sender is crafting a packet that will increase the size of the IP or TCP headers, the sender must fragment the packet so that the size of the packet does not exceed 576 bytes… or whatever the advertised MSS is. A host might advertise an MSS that is larger than the path will actually allow, in which case, PMTUD will help the sender discover the true MSS for that conversation. The MSS can be set in each direction, and they don’t have to match. There are a couple of relevant commands that might be easy to confuse when configuring MSS, so it’s important to keep them straight. First, the ip tcp mss bytes command on a router interface will affect traffic that is going TO or FROM the router on that interface. It does not affect traffic that is flowing through the router. On the other hand, if the ip tcp adjust-mss command is configured on your router, that will indeed affect the MSS of transit traffic. Note that this command will not INCREASE any packet’s MSS – it is only used to decrease an MSS that exceeds the configured threshold.
Latency comes from delays in propagation, queueing and processing that needs to take place. There isn’t much we can do about propagation delay. If you send a packet across the globe, then you are actually going to have to take into account the speed of light. That’s right… that slow rate of 186,282 miles per second can actually have a measurable effect on your traffic. Even at that speed, light takes about 13 milliseconds to travel the entire circumferance of the globe. Talk to someone who works with VoIP and they’ll tell you that it’s a significant delay that must be taken into consideration. Talk to someone in the financial trading industries and they’ll tell you that 13 ms is an eternity. And that’s before you ever even begin to take into account the other types of delay that affect a packet due to serialization, queueing and processing. Granted, those tend to be the far bigger issues with delay, but it’s worth noting that there are certain “sunk costs” so to speak, and other delays that we can affect through configuration, QoS and adherance to best practices. Since QoS is a topic unto itself on the CCIE exam, I’ll wait to discuss that until we broach the deep dive on that topic.
Bandwidth Delay Product
I’m going to address Bandwidth delay before I talk about windowing, as BDP is used in some calculations that Windowing performs. Bandwidth delay product is a calculation that can give you a picture of how much data is still sitting on the wire, or at least how much is capable of sitting out there still in the process of being delivered. It is derived by multiplying the bandwidth of a link’s capacity by the end-to-end delay in seconds. The bigger the number, the more data there could potentially be “out there” still waiting to be delivered. If you have high bandwidth links that also happen to have high delay, you are said to have a “Long Fat Network.”
I have to admit, this concept took me quite a while to wrap my head around!
The first thing we need to understand about TCP windowing is that it is not a concept that affects the size of individual packets. Rather, it helps define how many packets can be sent before a sender needs to wait for an acknowledgement so that it knows that it is safe to send more. Proper TCP windowing can help to ensure that each individual TCP conversation is optimized. One of the big factors that affects optimum Windowing is Bandwidth Delay Product. This comes into play because it doesn’t make sense to send the maximum amount of data, then stop completely while you await an acknowledgement of the data that has been sent. On long fat networks, this waiting process woud become especially burdensome because there is lots of available bandwidth that would be unused while the sender awaits the ack. Optimally, a host would anticipate how much time it is going to take for the ACK to be sent, and to continue sending data while the ACK is on its way. During the 3-way handshake, a host will begin the conversation with an initial window length. The “normal” maximum for this window would be 65,535 byte, as this field is 16 bytes in length in the TCP header. However, hosts can choose to set a TCP scaling option, which simply shifts the 16-bit window left by the number of bits set in the scaling window. So if the scaling option was set to 3, you would essentially take the binary value of the 16-bit scaling window, and add three zeroes to the end. This can also be expresses as multiplying the given value in the window by 2^3, or multiplying it by eight. While the VALUE of the WINDOW SIZE can change during a conversation, the value of the scaling option can ONLY be set once during the 3-way handshake. The actual implementation of this windowing concept changes from one operating system to another, so don’t be surprised if you see different algorithms for calculating and implementing the window length. An author out at PacketLife named Jeremy Stretch does a teriffic job of detailing this concept here: http://packetlife.net/blog/2010/aug/4/tcp-windows-and-window-scaling/
Global synchronization is the problem that occurs as a result of the fact that TCP-based communication will try to send the maximum amount of information possible until it is told to slow down. Unfortunately, more than one host might get to slow down this way, meaning that every host in the path sends less data, which leads to wasted bandwidth. As hosts begin to send more data, the waste slows down, as soon as limits are exceeded, everyone can get told to slow down again… all at the same time. To combat this, traffic can be dropped by QoS mechanimsm such as WRED. Again, since WRED and QoS are topics unto themselves on the CCIE exam, I’ll address those in detail when we get to them.
Options may be either a single octet referred to an “option kind”, meaning that the option is essentially simply a value. Or, it can be formatted with a first octet defining the option kind, a second octet defining the length, and a final meaning field which defines the actual data value for the kind. The length field includes the length of the kind and length fields, so it gives the value of the entire length of the option. The Wikipedia entry for TCP gives the following list of common options:
Some options may only be sent when SYN is set; they are indicated below as [SYN]. Option-Kind and standard lengths given as (Option-Kind,Option-Length).
- 0 (8 bits) – End of options list
- 1 (8 bits) – No operation (NOP, Padding) This may be used to align option fields on 32-bit boundaries for better performance.
- 2,4,SS (32 bits) – Maximum segment size (see maximum segment size) [SYN]
- 3,3,S (24 bits) – Window scale (see window scaling for details) [SYN]
- 4,2 (16 bits) – Selective Acknowledgement permitted. [SYN] (See selective acknowledgments for details)
- 5,N,BBBB,EEEE,… (variable bits, N is either 10, 18, 26, or 34)- Selective ACKnowledgement (SACK) These first two bytes are followed by a list of 1–4 blocks being selectively acknowledged, specified as 32-bit begin/end pointers.
- 8,10,TTTT,EEEE (80 bits)- Timestamp and echo of previous timestamp (see TCP timestamps for details)
- 14,3,S (24 bits) – TCP Alternate Checksum Request. [SYN]
- 15,N,… (variable bits) – TCP Alternate Checksum Data.
In addition to the options field in the TCP header, there is a field that contains certain flags used by TCP. Many of these flags are intuitive, but two of them are especially not so: PSH and URG. Yet again, I found no better explanation than the ones created by Jeremy Stretch at Packet Life. Rather than rewrite the wheel (what?), it just makes more sense to link you to the post here: http://packetlife.net/blog/2011/mar/2/tcp-flags-psh-and-urg/