TCP Troubleshooting
Common TCP problems and how to identify them with JitterTrap. Each section describes what to look for in the charts and how to diagnose the root cause.
Contents
- Bufferbloat — Latency increases under load
- Receive Window Starvation — Slow receiver limits throughput
- Retransmission Storms — Frequent packet loss
- Head-of-Line Blocking — Stalls from in-order delivery
- Nagle's Algorithm + Delayed ACK — 40ms latency on small writes
- Congestion Window Collapse — Sawtooth throughput pattern
- RTO Stalls — Multi-second stalls after loss
- Silly Window Syndrome — Tiny packets, poor efficiency
- TCP vs UDP — When TCP's guarantees hurt performance
- General Diagnostic Workflow — Step-by-step approach
- References — Key RFCs
Bufferbloat
Symptoms: Latency increases dramatically under load. A connection that shows 20ms RTT when idle may spike to 500ms+ when saturated. Interactive applications become sluggish during bulk transfers.
What It Looks Like in JitterTrap
In the TCP RTT chart:
- Baseline RTT is low (e.g., 20ms) when idle
- RTT climbs steadily as throughput increases
- RTT may reach 500ms or more at full load
- RTT returns to baseline when transfer completes
In the Throughput chart:
- High throughput correlates with high RTT
- The correlation is the signature—RTT tracks throughput
How to Test for Bufferbloat
- Start JitterTrap and establish baseline RTT to a remote host
- Begin a large file transfer (saturate the link)
- Watch the RTT chart—if it climbs from 20ms to 200ms+, you have bufferbloat
- Stop the transfer and confirm RTT returns to baseline
Causes: Oversized buffers in routers, switches, or host network stacks that allow excessive queuing.
Solutions:
- Enable Active Queue Management (AQM) like fq_codel on routers
- Reduce buffer sizes on network equipment
- Use TCP congestion control algorithms designed for bufferbloat (BBR, CUBIC with ECN)
References: RFC 7567 (AQM Recommendations), RFC 8289 (CoDel), Bufferbloat.net
Receive Window Starvation
Symptoms: Throughput is limited even though the network has capacity. The receiver can't process data fast enough.
What It Looks Like in JitterTrap
In the TCP Window chart:
- Advertised window drops toward zero
- Zero Window markers (⚠) appear
- Window may oscillate between zero and small values
- Pattern is consistent regardless of RTT
In the Throughput chart:
- Throughput drops when window shrinks
- May see "staircase" pattern as window opens and closes
How to Diagnose
- Watch the TCP Window chart for a suspect flow
- If window drops to zero while throughput also drops, receiver is the bottleneck
- Capture packets during the event
- In Wireshark, look for Window Full and Zero Window events
Causes: Slow application not reading from socket buffers, or socket receive buffer too small.
Solutions:
- Profile and optimize the receiving application
- Increase socket receive buffer size (SO_RCVBUF)
- Check for application-level backpressure
References: RFC 793 (TCP Flow Control), RFC 7323 (Window Scaling)
Retransmission Storms
Symptoms: Poor throughput despite adequate bandwidth. High CPU usage on endpoints.
What It Looks Like in JitterTrap
In the TCP Window chart:
- Frequent Retransmit markers (↩)
- Markers may be clustered (burst loss) or evenly distributed (steady loss)
- Window size may fluctuate as congestion control reacts
In the TCP RTT chart:
- RTT may spike during retransmission events
- Erratic RTT pattern if loss is causing timeout-based retransmits
- Smoother RTT if fast retransmit (duplicate ACKs) is working
How to Diagnose
- Count retransmit markers over time—occasional is normal, frequent indicates a problem
- Note if retransmits are clustered (burst loss) or distributed (random loss)
- Set a trap to capture packets when retransmits exceed a threshold
- Analyze in Wireshark to determine if loss is at a specific hop
Causes: Packet loss from congestion, bad links, or MTU issues.
Solutions:
- Identify where loss is occurring (use packet capture)
- Check for duplex mismatches
- Verify MTU is consistent across path
- Look for congested links or failing hardware
References: RFC 5681 (Fast Retransmit), RFC 6298 (RTO Calculation)
Head-of-Line Blocking
Symptoms: Periodic stalls in data delivery even when packets are arriving.
What It Looks Like in JitterTrap
In the Throughput chart:
- Gaps or dips that don't correlate with network congestion
- Throughput returns to normal after brief pause
- Pattern may be periodic if the same packet position is vulnerable
In the TCP Window chart:
- Dup ACK markers during the stall
- Window may remain healthy (receiver has space, just waiting for in-order data)
How to Diagnose
- Look for throughput dips that don't match RTT spikes
- Check for Dup ACK markers (indicate out-of-order arrival)
- If application streams multiple independent data types over one TCP connection, head-of-line blocking is likely
- Capture during a stall to see the out-of-order packets in Wireshark
Causes: TCP's in-order delivery requirement means one lost packet stalls all following data.
Solutions:
- Consider QUIC or other protocols with stream multiplexing
- Use multiple TCP connections for independent data streams
- Reduce RTT to minimize stall duration
References: RFC 793 (In-Order Delivery), RFC 9000 (QUIC)
Nagle's Algorithm + Delayed ACK
Symptoms: Small writes have unexpectedly high latency (often ~40ms).
What It Looks Like in JitterTrap
In the TCP RTT chart:
- Very consistent ~40ms RTT spikes
- The regularity is the key signature—network jitter is random, this is fixed
- Pattern appears on request/response workloads with small messages
In the Throughput chart:
- Low throughput with periodic bursts
- Each burst separated by ~40ms gaps
How to Diagnose
- Look for suspiciously consistent 40ms RTT
- Check if the pattern occurs only with small messages
- Capture packets and look for delayed ACKs (200ms timer reduced to 40ms in most stacks)
- Test with TCP_NODELAY to confirm—if RTT drops dramatically, this was the cause
Causes: Nagle's algorithm waits for ACK before sending small packets. Delayed ACK waits ~40ms before acknowledging. Together they create artificial delays.
Solutions:
- Set TCP_NODELAY on latency-sensitive sockets
- Use TCP_QUICKACK on the receiver
- Batch small writes into larger ones
References: RFC 896 (Nagle's Algorithm), RFC 1122 §4.2.3.2 (Delayed ACK)
Congestion Window Collapse
Symptoms: Throughput drops sharply and recovers slowly after packet loss.
What It Looks Like in JitterTrap
In the Throughput chart:
- Sawtooth pattern: gradual increase, sharp drop, slow recovery
- Each cycle takes several RTTs to recover
- May see multiple cycles during sustained transfer
In the TCP RTT chart:
- RTT increases as congestion builds (bufferbloat)
- Retransmit markers appear
- RTT drops when congestion control backs off
How to Diagnose
- Look for the sawtooth throughput pattern
- Note if RTT spikes precede the throughput drops (bufferbloat triggering loss)
- Time the recovery—slow ramp indicates traditional AIMD congestion control
- Compare behavior with different congestion control algorithms (BBR vs CUBIC)
Causes: TCP's congestion control cuts the sending rate dramatically after detecting loss.
Solutions:
- Reduce packet loss (the real fix)
- Consider BBR congestion control for lossy links
- Use ECN to get early congestion signals before loss occurs
References: RFC 5681 (Congestion Control), RFC 8312 (CUBIC), RFC 3168 (ECN)
Retransmission Timeout (RTO) Stalls
Symptoms: Long stalls (1-3+ seconds) followed by a burst of activity. Much worse than typical packet loss recovery.
What It Looks Like in JitterTrap
In the TCP RTT chart:
- Gaps of 1+ seconds with no data
- Multiple Retransmit markers (↩) clustered after the gap
- Pattern: silence, then burst of retransmits, then recovery
In the Throughput chart:
- Complete stop, then sudden burst
- Much longer pause than normal retransmission
How to Diagnose
- Time the stall duration—1+ seconds indicates RTO, not fast retransmit
- Check if retransmits cluster after the gap (RTO fired)
- Look for patterns—tail loss (end of burst) often triggers RTO
- Capture packets and check if fast retransmit (3 dup ACKs) failed
Causes: When fast retransmit (3 duplicate ACKs) fails, TCP falls back to RTO-based recovery. The minimum RTO is often 200ms-1s, and it doubles with each failed attempt (exponential backoff). A lost retransmit can cause multi-second stalls.
Solutions:
- Investigate why fast retransmit is failing (tail loss, small windows)
- Enable TLP (Tail Loss Probe) and RACK if available
- For latency-sensitive applications, these stalls may be unacceptable—consider UDP
References: RFC 6298 (RTO Calculation), RFC 5681 §3.2 (Fast Retransmit)
Silly Window Syndrome
Symptoms: High packet rate but low throughput. Lots of small packets instead of full-sized segments.
What It Looks Like in JitterTrap
In the TCP Window chart:
- Very small advertised window values (bytes, not KB)
- Window may oscillate between tiny values
In the Top Talkers:
- High packet count relative to byte count
- Throughput is a fraction of expected
How to Diagnose
- Compare packet rate to byte rate—if packet rate is high but throughput is low, packets are small
- Check TCP Window for tiny values
- Look for recovery pattern after window starvation
Causes: Receiver advertises tiny windows (e.g., after window starvation recovery). Sender sends tiny segments to fill the advertised window. Overhead dominates.
Solutions:
- Most TCP stacks have SWS avoidance built in
- If you're seeing this, check for broken or embedded TCP implementations
- Increase receive buffer sizes
References: RFC 813 (Window and Acknowledgement Strategy), RFC 1122 §4.2.3.4 (SWS Avoidance)
TCP vs UDP: When TCP Hurts
TCP is designed for reliable, ordered delivery of bulk data. These guarantees come at a cost that's often invisible until you look closely:
| TCP Behavior | Cost for Real-Time Systems |
|---|---|
| Guaranteed delivery | Stalls waiting for retransmits of data that may no longer be relevant |
| In-order delivery | Head-of-line blocking—one lost packet blocks everything behind it |
| Congestion control | Throughput collapse after loss; slow recovery; competing flows affect each other |
| Connection establishment | 1.5 RTT before first data byte; connection state on both ends |
| Flow control | Slow receiver blocks fast sender, even if data could be dropped |
Consider UDP when:
- You can tolerate some loss
- Need lowest latency
- Data has a "freshness" deadline
- You want application-level control over retransmission decisions
Examples: VoIP, video conferencing, gaming, live telemetry, sensor data, financial trading, DNS.
General Diagnostic Workflow
For any TCP performance issue:
-
Establish baseline — Observe charts during normal operation. Know what "good" looks like.
-
Identify the flow — Use Top Talkers to find the specific connection with issues.
-
Check RTT first — High or variable RTT affects almost everything else.
- High RTT → check for bufferbloat, long paths, or congestion
- Variable RTT → check for jitter, route changes, or competing traffic
-
Check the Window — If throughput is limited but RTT is reasonable:
- Small window → receiver issue (application not reading, buffer too small)
- Window collapse → congestion control reacting to loss
-
Look for markers — Retransmit (↩) and Zero Window (⚠) markers tell you what's happening:
- Many retransmits → packet loss problem
- Zero window → receiver backpressure
-
Correlate events — The most useful insights come from correlating multiple charts:
- RTT spike + throughput drop → bufferbloat
- Window drop + throughput drop → receiver starvation
- Retransmit + throughput drop → packet loss
-
Capture packets — Set traps to automatically capture when thresholds are exceeded. Analyze in Wireshark for definitive diagnosis.
References
Key RFCs
| RFC | Title |
|---|---|
| RFC 793 | Transmission Control Protocol |
| RFC 896 | Congestion Control in IP/TCP (Nagle) |
| RFC 1122 | Requirements for Internet Hosts |
| RFC 3168 | Explicit Congestion Notification |
| RFC 5681 | TCP Congestion Control |
| RFC 6298 | Computing TCP's Retransmission Timer |
| RFC 7323 | TCP Extensions for High Performance |
| RFC 7567 | IETF Recommendations Regarding AQM |
| RFC 8312 | CUBIC Congestion Control |
| RFC 9000 | QUIC Transport Protocol |
Related
- Media Streaming — How these problems affect streaming applications
- Network Impairments — Test how your application handles these conditions