First time here? Check out the FAQ!
THIS IS A TEST INSTANCE. Feel free to ask and answer questions, but take care to avoid triggering too many notifications.
0

Excessive TCP Dup ACK and TCP Retransmissions

  • retag add tags

I'm doing an SFTP transfer between two servers about 70ms RTT apart and seeing excessive TCP Dup ACK and TCP Retransmissions. The circuit size is 50 mbit/sec, but I'm getting a transfer speed of 500 kbit/sec or less. What could be causing this?

Receiver https://www.cloudshark.org/captures/e...

Sender https://www.cloudshark.org/captures/e...

anonymous user
asked 2018-12-05 16:45:01 +0000
edit flag offensive 0 remove flag close merge delete

Comments

Could you please enable SACK on both endpoints and do the capture again? An absence of SACK option makes loss recovery very inefficient.

What is the "sender" capture location? It has a bit strange IP TTL of 60. Is it several hops away from the endpoint or just non-usual TTL?

Packet_vlad's avatar Packet_vlad (2018-12-05 18:11:53 +0000) edit

The sender is an AIX server which is why the TTL is unusual and starts at 60. The receiver is a Linux server with SACK already enabled. I will check the setting on the AIX sender.

Based on what you're seeing so far, what do you think is the most likely cause?

neteng.ams's avatar neteng.ams (2018-12-06 00:55:41 +0000) edit

I need to take a closer look on it but actually it looks like micro-bursting with 1Gbit interface speed hitting a buffer or policer so strongly so it is causing bulk packet loss. At the same time recovery process is extremely slow because of SACK absence.

Packet_vlad's avatar Packet_vlad (2018-12-06 09:17:24 +0000) edit

The 3-way handshake in the sender capture tells us that the AIX sender doesn't support SACK. It's possible that SACKs may not help in this case - but as @Packet_vlad suggests, SACK is more efficient in general and so you should enable it if you can.

The MSS=1380 in the server's SYN-ACK is a strong clue that there's a Cisco ASA firewall in the path.

The minimum RTT of 68.1 ms means that the client and server are relatively far apart.

Thanks for this very interesting capture. There are a couple of elements to the problem.

Philst's avatar Philst (2018-12-07 03:11:13 +0000) edit
add a comment see more comments

1 Answer

3

There's a very consistent regularity to the way the packets flow from the client to the server. A particular pattern is repeated again and again - at roughly 7 second intervals - with about 500,000 KB transferred per interval. I'll define the large burst of packets as the start of the pattern.

Here are my key observations (with some supporting TCP Trace charts below):

1) The client sends a large burst of 250 KB, but large portions are lost after the first 100 KB. In the first TCP-Trace chart below, we see the 100 KB successful burst, the yellow area with no packets, a few subsequent packets that made it through, then one RTT later the second 100 KB burst.

2) The server's receive window is close to 1 MB, but the sender appears to use its own RWIN of 261,288 bytes as its own transmit "limit". The sender manages to maintain close to this "in flight" value throughout the whole period.

3) One RTT later, the sender receives ACKs for the 100 KB that wasn't lost and manages to transmit a further 100 KB without any errors. The large number of original lost packets trigger many Dup-ACKs and in response, the sender retransmits a single packet to begin to fill the gap. Following the horizontal "Ack line" on the chart we see the single retransmitted packet and the step up of the Ack line.

4) One RTT after that, there's another single packet retransmission.

5) The large initial gap in the received data is then filled in at the rate of just one packet per RTT. Also, after several RTTs (perhaps as the sending congestion window is opened), the sender begins to send small bursts of new data so that the in-flight value of 250 KB is maintained.

6) On the second chart below, we've zoomed-out to encompass a full pattern and the start of the next one. The dark blue circle is around the initial two large bursts, the red circle is around all the single packet retransmissions and the light blue circle is around all the small bursts of new packets. It looks like the sender eventually waits for every sixth round trip so that it can send a full 6-packet application "block" (there's a Push flag at the end of these 6-packet bursts).

7) Eventually, the original large gap has been completely filled-in and we see the Ack line jump all the way up as the original two large bursts and all the smaller "new" bursts are fully acknowledged. At this point, in-flight data is zero and the sender is now free to begin the whole pattern all over again.

TCP Trace - Initial Burst

TCP Trace - Full Pattern

So, what are the things that need to investigated further?

A) The bulk packet loss, always after a large burst of 100 KB, points to a device in the path that only has a 100 KB buffer space. The most likely candidate will be the router where the path ... (more)

Philst's avatar
518
Philst
answered 2018-12-07 05:34:35 +0000, updated 2018-12-11 07:58:08 +0000
edit flag offensive 0 remove flag delete link

Comments

add a comment see more comments

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss.

Add Answer