THIS IS A TEST INSTANCE. Feel free to ask and answer questions, but take care to avoid triggering too many notifications.
0

What causes retransmissions?

  • retag add tags

Don't have enough points to post a picture so here's what's happening.

I have two servers, both running Linux Mint 19.3. Tried Mellanox 10Gbps cards (Mellanox DAC) and Intel 10Gbps NICs (Intel branded DAC), no switch..... 5 meter DAC attaching both servers directly. Both servers also have a 1Gbps NIC that was active the entire time. I edited the 'hosts' file on each server and entered the host name and 10Gbps IP address for the other box.

When I copy a 15, 30, 50 gig file between the two servers, I'll get about 450-500MB/s one way but copying the same file back in the other direction, speeds will start off around 350-400MB/s but quickly fall back to 150+MB/s. I've tested the IO subsytem on both servers and the SSDs inside them can read/write at about 550MB/s.

I used Wireshark on one of the boxes and saw this:

Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)

I see that error repeated non-stop during the time a file copy is going on. I'm not a Linux (or networking expert) but I'm thinking this might be a case for setting up a proper route on the Linux boxes so ALLLLLLLLLLLL traffic between these boxes must absolutely stay on the 10Gbps NICs. Since I'm no Linux expert, I'm stuck here.

iPerf shows 9.6Gbps back and forth.

If I can't figure out the Linux route stuff, should I just grab a switch that has 10Gbps ports and have these servers talking through that (and pulling their 1Gbps CAT5 cables)?

Road Hazard's avatar
1
Road Hazard
asked 2020-03-09 00:28:47 +0000
edit flag offensive 0 remove flag close merge delete

Comments

add a comment see more comments

1 Answer

0

This is probably due to packet loss in your "slow" direction - but no loss in your "fast" direction.

The TCP "Congestion Avoidance" algorithm slows down the transmit rate when it detects packet loss (and assumes "congestion" somewhere in the path).

We'd need to see a packet capture file to prove this. However, you can see this for yourself by looking at Wireshark's graph: Statistics - TCP Stream Graphs - Window Scaling.

Have a look at my answer to question number 15002.

https://ask.wireshark.org/question/15...

Philst's avatar
518
Philst
answered 2020-03-09 03:34:50 +0000, updated 2020-03-09 03:35:42 +0000
edit flag offensive 0 remove flag delete link

Comments

I guess my next obvious question is (and I know this is a silly even asking but :) ...... would you care to wager a guess as to what could cause this congestion? My SSDs can read/write way faster then the speeds I'm seeing so I don't think that's the problem. Tonight, I was sitting in front of server A and was accessing the NFS share on server B. The file copied over at 450MB/s and was slowly, SLOWLY dropping in speed. Copying the same file back to server B.... started at around 200MB/s and very, VERY quickly dropped to 60MB/s. This weekend, I'll grab a packet capture from each PC and post if I can.

I have a 2.5M DAC on order that should arrive tomorrow. Maybe I'm at the edge of what a passive DAC is capable of? If that ... (more)

Road Hazard's avatar Road Hazard (2020-03-10 00:37:37 +0000) edit

That's a good question, given that you are directly connected.

A capture would prove the TCP "congestion" hypothesis and/or any other TCP stack issues.

As you suggest, it is good odds that you may be looking at hardware issues. So swapping things around might be the answer.

I'm not a big fan of guessing, so a capture file(s) will provide more solid evidence.

Philst's avatar Philst (2020-03-11 04:02:55 +0000) edit

A full capture can give you good head start in pinpointning issues. But the exact reason why a system is slower might not berevealed in a packet capture.

hugo.vanderkooij's avatar hugo.vanderkooij (2020-03-11 14:14:42 +0000) edit

True, but a capture will help you eliminate "network" causes. It can also tell you that performance issues are inside a server and/or client, which transaction types are slow and which aren't, etc.

In the case above, I was suggesting that a capture would prove/disprove the TCP "Congestion Avoidance" hypothesis.

Philst's avatar Philst (2020-03-17 01:39:16 +0000) edit
add a comment see more comments

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss.

Add Answer