The network performance impact of DDoS protections

The performance impact of DDoS protections

Digital platforms and services need to protect themselves from Distributed Denial of Service (DDoS) attacks. These attacks rely on a networks of bot (commonly thousands of compromised machines connected to the Internet). Each of these bots contribute to saturate your digital platform (at the network, load balancer and server levels) and make your service unavailable. The fact that these attacks are distributed across thousands of devices spread across a multitude of operator networks make their mitigation a difficult task (if you want to dig further into the nature of DDoS: https://en.wikipedia.org/wiki/DDoS_mitigation)Most organizations relying on digital services / web applications to run their business subscribe to DDoS protection services. The question is to understand how these DDoS protections influence the network path and how fast their users access their digital platform. Every customer of these services need to evaluate the performance impact of DDoS protections for their users.

Most world-scale cloud / CDN providers have their own DDoS mitigation offerings: 

 If you are looking for a good comparison of these services, please take a look at this article

How DDoS protections work

All of the DDoS protection services listed above rely on massive scale cloud infrastructure and especially distributed network presence. This enables them to remain unaffected by massive quantities of DDoS traffic they have to filter out. 

From this list, the first distinction to make is to isolate:

  • the cloud service providers offering DDoS protection for the workloads hosted in their cloud platform,
  • the providers who offer that service for workloads independently of their location (e.g. in your datacenter). 

Finally they all aim at filtering out botnet traffic by a series of heuristics powered by network inspections, proxies, etc. They perform that filtering through different means: 

  • Acting as a reverse proxy like Projectshield
  • Redirecting traffic to the DDoS protection platform by changing your DNS records
  • Having the DDoS protection providers announcing your IP prefixes in their BGP policies to redirect your traffic through them. 

All of these methods consist in rerouting traffic that would normally go from your users’ network / autonomous systems to your platform directly through your provider’s platform to deliver a “clean” traffic to your platform.  

This DDoS protection may be active 365 days a year or on demand. 

What’s the performance impact of a DDoS protection? 

In case of DDoS your performance will be protected, but what is the performance impact in normal conditions? 

The very first thing to understand is that the path from your users to your platform will be affected, either for all users or for certain users at certain times. 

Let’s take an example of a DDoS protection provided by CloudFlare through BGP routing. In this case, CloudFlare announces their customer’s prefixes in their BGP policy. This is done 365 days a year and the CloudFlare route may be preferred in case of DDoS to avoid flooding the customer’s infrastructure with DDoS traffic

In the screenshot below, we see the route taken from one of Kadiska’s Stations to the customer’s platform. We can identify multiple paths to go from one end to another. 

We see two main routes from our Station in Tokyo to this customer’s platform:

  • the route goes out from a first provider which takes the decision to route all the traffic through iD3.net.
  • iD3.net then in most cases routes the traffic through its own network straight to the customer’s autonomous system (mostly likely that AS peers directly with the customer’s AS), but in some cases can decide to route through CloudFlare’s infrastructure. 

How DDoS protection influence my BGP routes and the reachability of my cloud platform

How DDoS protection influence my BGP routes and the reachability of my cloud platform

The most important fact to retain from this diagram is that the routing decisions are driven by the BGP policy of the operator / AS located on your users’ end (left) which will make routing decisions based on performance (shortest path) and economics (cost of transit / peering arrangements). If you need some clarification on how BGP manages this, please refer to this article. 

What are the key performance questions to answer?

  • First of all, you need to understand where your users are located and through which operator or AS they connect. 
  • Second question on your list: where are your points of presence (or the ones of your cloud / hosting providers)? How far from a latency and number of hops standpoint they stand from your users’ operators. 
  • Third question on your list: where are your DDoS vendors points of presence and how close are they to your users? 
  • Finally, how do the most important AS on your users’ end route the traffic to the different elements in your platform? 

This will tell you whether you route your traffic in normal conditions directly to your AS or through your DDoS protection vendor and whether it translates into a performance loss or gain (more or less latency, packet loss and hops). 

Depending on the route taken, latency will be higher or lower, the number of hops larger or smaller. Packet loss will also vary. 

What we see hereinabove is the evolution over 2 weeks, but you need to consider that all of this is dynamic and driven by multiple factors and is subject to frequent change: 

  • Each operator’s BGP policy on the left hand
  • The evolution of the network path (unavailability, congestion, events) affecting all possible routes which are evaluated. 

What can you do to optimize the reachability of your platform? 

Monitoring data is only as useful as it is actionable! Obviously your BGP policy impacts how you route traffic from your autonomous system to the rest of the internet. 

As we focus on incoming traffic, the key question here is how can you affect the route taken to reach your own platform. It is easy to understand that the further an AS sits from your own AS, the smaller is your ability to act on it.  

The only actions you can take are: 

  • Amending peering and transit arrangements (to avoid advertising routes from / to your AS through AS that deliver poor performance). 
  • Asking your transit providers to act in the same way and avoid poorly performing routes. 

As an example (see figure below), you may want to avoid the route from Cogent to CloudFlare (leading to the customer’s AS) as it shows high packet loss rates and prefer other routes (including through the same tier 1 but through a different path). 

How to optimize your incoming AS path when using DDoS protections

How to optimize your incoming AS path when using DDoS protections

If you would like to take action and start optimizing the reachability of your platform (going through a DDoS protection service or not), I suggest you take a look at this article.


bgp routes: how does the internet routing work

How does BGP routing work?

How does BGP routing work?

BGP stands for Border Gateway Protocol. It refers to the routing protocol used to ensure proper interconnection between autonomous systems (AS). eBGP (external BGP) is used between AS’s, while iBGP (interior BGP) is used within the AS (Autonomous System).

These basic concepts are explained in our article “What is BGP?”.

Let’s now deep dive a bit into how BGP actually works.

BGP is the protocol used in the backbone of the Internet. It allows organizations that have their own AS (typically Internet Service Providers and large organizations) to interconnect with others. This type of interconnection between AS’s is called a peering.

The basics of BGP peering

The Tier-1 club

When an AS gets set up, it peers with other AS’s to declare its IP prefixes (Prefixes refer to the IP subnets it owns), which are then declared to other AS’s, and so on. In this way, when new prefixes are announced, they get propagated around the Internet.

If you own an AS, it does not mean though that you can automatically make it available globally! Among the 100,000 AS’s , only about twenty of them can reach the whole Internet destinations without purchasing transit from any other AS, forming the so-called Tier-1 club.

The BGP routes

Unlike other routing protocols, there is no peer discovery process.

Each BGP speaker, which is called a “peer”, exchanges routing information with its neighboring peers in the form of network prefix announcements.

With prefix announcements, the information is sufficient to construct a graph of AS connectivity, like illustrated hereunder.

As you can see, communication between two prefixes can often occur through different paths.

Prefix J from AS 1559 can for example reach Prefix G from AS 257 via AS 20 or AS 13936.

So how do routers choose between different possible routes?

The routing decision

The BGP AS path

BGP does not work like any other traditional routing protocols that use metrics like the distance or costs (for example the bandwidth) to make routing decisions. Instead, BGP uses various attributes to route the traffic. 

The prime attribute of BGP is called “AS path”. This is a list of AS numbers describing the inter-AS path to a destination. The AS path is so critical to the function of BGP that the protocol is often referred to as a Path Vector routing protocol.

The figure here above shows how the AS path is propagated.

The AS 1 peer sends its prefix to the AS 6 and AS 5 peers (AS path [1]), which in turn send the prefix list respectively to AS 3 (AS path [6, 1]) and AS 2 (AS path [5, 1]) peers. AS 2 peer propagates this prefix list to AS 4 peer (AS Path [2, 5, 1]). Finally, AS 4 peer propagates the AS 1 peer prefix to AS 3 peer (AS Path [4, 2, 5, 1]).

So as a result, AS 3 is accessible from AS 1 through the AS path [6, 1] as well as AS Path [4, 2, 5, 1].

The BGP routing decision process

From the previous example, you may think that the chosen path between AS 1 and AS 3 peers will be via the AS 6 because this is the shortest path. 

Well, it can be the case, but this is not the strict rule! In fact, the best path is chosen based on policies, which are configured via various prefix filters, by announcing specific routes or by manipulating BGP attributes. 

When a destination is reachable from two different paths, BGP selects the best path by sequentially evaluating the path attributes:

  • Weight
  • Local preference
  • Originate
  • AS path length
  • Origin code
  • MED (Multi Exit Discriminator)
  • eBGP path over iBGP path
  • Shortest IGP path to BGP next hop
  • Oldest path
  • Router ID
  • Neighbor IP address.

The main point here is not to go into all details of these attributes, but to understand the basic principle of the routing decision process.

Taking back the example above, if the “weight” attribute of the AS path [4, 2, 5, 1] from AS 1 to AS 3 is greater than the attribute of AS path [6, 1], then this path is chosen. If the “weight attribute is equal for both paths, then the next attribute is evaluated (local preference), and so on.

So, in short, by using BGP attributes, you can make sure your traffic will transit through your preferred AS’s, based for example on non-technical parameters like financial agreements you may have with other AS owners.

How and when are BGP routing protocol data exchanged?

BGP uses the TCP transport protocol to transfer data. This provides reliable delivery of the BGP updates. BGP uses TCP port 179 for this.

It uses the Finite State Machine (FSM) model to maintain a table of all BGP peers and their operational status.

Compared to other routing protocols, BGP does not send any periodic updates of routing data. Instead, it sends updates only when changes occur on the network. For example, these changes can be due to session resets, link failures and policy changes.

Finally, BGP periodically sends keep-alive messages to check the TCP connection.

What can go wrong with BGP?

First, we have seen that BGP peering is configured manually. Human configuration is prone to errors. Or worse, this is prone to malicious attacks.

As an example, do you remember the IBM cloud outage back in June 2020? This was due to a BGP hijacking!

Another recent example from December 2020 is the Google Euro-Cloud outage due to an incorrect Access Control List configuration, which led the BGP routing protocol to withdraw the europe-west2-a availability zone from the rest of the Google backbone network.

Secondly, and this is certainly one of the major challenges for BGP, processing updates of large routing tables can be a problem for some routers.

Each router needs to store a local database of all prefixes announced by each routing peer. A router has a finite capacity to process updates and once the update rate exceeds its local processing capability, then the router will start to queue up unprocessed updates. In the worst case, the router will start to lag in real time, so that the information a BGP speaker is propagating reflects a past local topology, not necessarily the current local topology. At its most benign, the router will advertise ‘ghost’ routes where the prefix is no longer reachable, yet the out-of-sync router will continue to advertise reachability.

The following graph shows the evolution of the IPv4 BGP routing table size since the very beginning of BGP:

As you can see, even with the exhaustion of available IPv4 addresses, the size of routing tables still dramatically increases!

The APNIC organization has published the following detailed article on this topic.

Not only does the average BGP routing table size increase, the frequency of BGP updates also follows the same path. In another article, APNIC takes a specific AS (AS 131072) into consideration to measure the evolution of the routing table updates per year. Starting at 300.000 updates in 2009 (about 30 updates per hour), it exceeded 800.000 updates in 2020 (about 90 updates per hour).

Takeaway

BGP is an important part of the Internet foundations.

As it must be configured manually, it is prone to human error as well as security attacks.

Furthermore, the evolution of networks makes it more and more susceptible to instabilities and services disruptions. 

In a global digital services context, monitoring the BGP behavior in terms of paths discovery, performance monitoring, as well as path changes, becomes critical!

If you would like to learn more about how you can monitor network performance from the internet to your AS (or your cloud provider’s AS) or from your AS to different digital assets, I recommend that you read this article.


How to measure network latency: the 5 best tools

The 5 best tools to measure network latency

The goal of this article is to summarize classical ways used to measure network latency through synthetic testing, that is by injecting traffic on the network. We will highlight the pros and cons of each tool to make sure you pick the right one for your needs. 

These 5 tools are:

  • Ping
  • Traceroute
  • OWAMP
  • TWAMP
  • iPerf

Simple tests using your PC

Let’s start with the ones whose implementation is straightforward.

1. Measure network latency with PING 

Let’s begin with the simplest way you may think about when trying to measure network latency: using your PC.

How it works

Ping is a standard tool available on all types of OS platforms that measures the round trip time between your PC and the target you specify (domain or IP address).

Just open your console and type pingdomain. If you provide a domain name, the first step the ping will perform is resolving it to the corresponding IP address. As an alternative you can provide the IP address directly.

The following example shows that the minimum, maximum and average round trip times are respectively 20ms, 24ms and 21ms.

ping measure network latency

By default, a ping command sends four ICMP Echo Request packets to the destination. This latter responds  back with ICMP Echo Reply packets. 

Advantages

The main advantage of this method is its simplicity. You do not need anything installed on your PC (except that ping tool) and can directly target any domain or corresponding IP address without additional configuration or software component.

Limitations

Unfortunately, simplicity often comes with limitations.

  • First, for security reasons, ICMP packets may be blocked by intermediate firewalls. In this case, the target will  never respond to your ICMP Echo Request and you will not be able to measure the network latency.
  • Secondly, the ICMP protocol may be handled with low priority by intermediate routers, distorting the accuracy of the measurements.
  • Last but certainly not least, as ping measures a round trip delay, you cannot differentiate the network latency in both directions. So it is not possible to detect network problem that would occur in a specific traffic direction.

2. Measure network performance with TRACEROUTE 

As alternative to ping, you can make use of traceroute:

  • tracert command on Windows machines
  • traceroute command on Linux machines

How it works

Traceroute uses the TTL (Time To Live) field of IP packets to discover intermediate routers between a source and a destination.

The principle is quite simple. Each time a router routes a packet, its corresponding TTL field value is decremented by 1. When this value reaches 1, the router drops the packet and sends an ICMP Error message “TTL exceeded in transit” back to the source. This mechanism is used to prevent packets from looping indefinitely, which may cause the whole network to crash.

Receiving ICMP packets back from intermediate routers allows the source to discover all of them as well as measure the network latency to reach them.

traceroute to measure network latency

In its simplest form, traceroute uses ICMP protocol for sending packets (Echo Request). And as you know now, some routers may not respond to ICMP packets for security reasons. This phenomenon is shown hereunder.

example of a traceroute command that measure network latency

On this figure, you can also see that traceroute sends four packets per hop by default on a Windows platform.

More advanced implementations also use UDP or TCP transport protocols and propose more options (size of the packets, probes interval, number of probes per hop, …). This ensures a better alignment with real network traffic and the way it is routed. Nevertheless, all traceroute implementations still rely on the ICMP Echo Reply messages that can be handled with much lower priorities as explained previously.

Furthermore, the reported value corresponds to the round trip delay. So again, it is not possible to detect asymmetric network problems.

Advantages

The main pros of traceroute are:

  • Easy to use and implement
  • Gives a representation of the network path

Limitations

On the other hand, Traceroute also comes with some limitations:

  • If traceroute uses ICMP, all the limitations listed for Ping are also valid.
  • Round trip delay only.

More advanced tests to measure network latency

More advanced synthetic test techniques can help getting around the problem of handling ICMP packets with low priority as well as the limitation of round trip measurement. 

Let’s introduce the most well known.

3. Using OWAMP for one-way and two-way network response times

How it works

OWAMP stands for One-Way Active Measurement Protocol.

It is standardized under RFC 4656.

Compared to its counterpart ping/traceroute, OWAMP measures the network latency in one direction and does not rely on the ICMP protocol to calculate it.

In conclusion, OWAMP provides more precise data, as it uses UDP packets in one direction to measure the latency. You can fine tune your tests to better align with your specific requirements and use case. For example, you can define the size of each packet, the interval between two consecutive packets in a test, as well as the number of packets to send per test.

And of course, it is easy to detect any network problem related to a specific traffic direction by performing the same test in both directions separately.

As result, you get the minimum, median, and maximum value of the network latency between your source and the targeted destination (as well as other useful data like the one-way jitter and the packet loss).

example of an OWAMP test that measures network latency

In addition, OWAMP supports authentication mechanisms to ensure security.

OWAMP seems to be the way to go, right ?

The answer is “Yes”, but only if you manage both ends of the test. Indeed, OWAMP requires the implementation of a client/server architecture. A piece of software must be installed on both the source and the destination of the test. 

Furthermore, the accuracy of the data depends on their proper clock synchronization.

Last but not least, OWAMP does not properly support NAT (Network Address Translation) configurations. Certainly something to take into account…

Advantages

  • One way delay measurement
  • Accuracy

Limitations

  • Requirements:
    • Control of both ends
    • Proper clock synchronization
  • No support for NAT

4. TWAMP for 2-way latency measurement

How it works

TWAMP, which stands for Two-Way Active Measurement Protocol, is an alternative to OWAMP. It is standardized under RFC 5357.

Compared to OWAMP, TWAMP measures latency in both directions.

It first uses TCP to establish a connection between the source and destination, then uses UDP packets to measure the latency. It also uses a client/server architecture.

Advantages and limitations

TWAMP globally shares the same advantages and limitations as its counterpart OWAMP.

5. Measure network performance with iPerf3

You may think about iPerf (latest version is iPerf3) as an alternative to OWAMP or TWAMP.

How it works

This tool also uses a client/server model, where data can be analyzed from both ends.

Nevertheless, as the data collected are the throughput, jitter and packet loss, it is more used to measure the overall link quality between two endpoints, and not directly the network latency itself.

Advantages

  • Firstly, the main advantage of iPerf is that it supports a variety of parameters and can use UDP as well as TCP to send probe packets, which can better align with your specific use cases.
  • Secondly, iPerf also provides throughput information

Limitations

  • iPerf requires a client / server implementation (i.e. on both ends).
  • iPerf provides no network latency measurement.

Summary

Tool
Metrics
Implementation
Pros
Limitations / Drawbacks
Ping
  • Round trip delay
  • Packet Loss (approx.)
  • Simple test from any machine
  • Simple
  • ICMP accuracy
  • Round trip only
Traceroute
  • Round trip delay
  • Packet Loss (approx.)
  • Network path
  • Simple test from any machine
  • Simple
  • Network path
  • ICMP accuracy (if not using UDP/TCP)
  • Round trip only
OWAMP
  • One way and Round trip delay
  • Jitter
  • Packet loss
Minimum, mediam and max values
  • Client server model
  • Accurate
  • One way delay measurement
  • Deployment model
  • Be careful to clock synchronization
  • No support for NAT
TWAMP
  • One way and Round trip delay
  • Jitter
  • Packet loss
Minimum, mediam and max values
  • Client server model
  • Accurate
  • One way delay measurement
  • Deployment model
  • Be careful to clock synchronization
iPerf
  • Jitter
  • Packet loss
  • Throughput
  • Client server model
  • Accurate
  • Throughput information
  • No network latency measurement
  • Deployment model

Takeaways

Being able to correctly measure network latency is a key aspect for ensuring digital services performance.

On one hand, measuring it can be as simple as performing ping or traceroute commands from your PC. Just bear in mind that these techniques, although simple to implement, suffer from some significant limitations:

  • Prone to measure imprecision due to the nature of ICMP handling
  • Only report RTT value (no problem detection for a specific traffic direction)

On the other hand, if you manage both ends of the test, a better option would be to use OWAMP or TWAMP. The measurements will be more precise and you will be able to detect problems related to specific traffic direction. Nevertheless, one important question stays open: “How to orchestrate such tests in a distributed and complex production environment?”. This is where specialized solutions certainly come into play.

Finally, If you are interested in this topic you may also find worthwhile looking at these two other articles:

  • “Why network latency drives digital performance” explains why network latency is a critical driver for web applications performance. It is all the more true if you think about cloud-based services, where the network path may be complex. Nowadays, the network is definitely one of the major drivers of digital services performance.
  • What is network latency? How it works and how to reduce it”, we define network latency as being the delay for a packet to travel on a network from one point to another. We introduce some techniques used to measure this latency, like RTT (Round Trip Time) measurement.


What is BGP?

What is BGP?

Whenever you surf the internet, you use BGP.

Let's start with the very basics:

  1. BGP stands for Border Gateway Protocol.
  2. This is a routing protocol that enables the Internet to work.

Secondly, if BGP is a “routing protocol”, what does “routing” then mean? Well, routing is the process of selecting the best path to a destination (an IP address in our case) across one or multiple networks. To summarize, BGP is the protocol that ensures that you can get the best possible performance when you access a resource on the internet! 

History

In the early days of Internet, only a small number of networks had to connect to each other. At that time, routing between them was quite static and easy. But as you know, the Internet expanded massively, making this static configuration impracticable.

This is why dynamic routing protocols were invented. In 1982, Eric C. Rosen from BBN Technologies defined EGP (Exterior Gateway Protocol). In 1984, David L. Mills formerly specified it . It was finally ratified under RFC 904.

EGP is a protocol for exchanging routing information between two neighbor gateway hosts in a network of autonomous systems. An autonomous system is a collection of networks under a common administrative domain.

The basics of EGP routing protocol

EGP has three major functions:

  1. Establish a set of neighbors
  2. Check neighbors availability
  3. Inform neighbors about reachable networks within their autonomous systems

 

Even if EGP was a big step forward, it has some major limitations: 

  • EGP works supports only tree structure topologies and hence does not support multi-path network environments. This limits its efficiency as illustrated hereunder.
  • Central management: This reduces its scalability, which is a major drawback in today’s fast growing Internet where no central authority controls it.

The BGP protocol

First let’s explain what an Autonomous System (AS) really is.

The Internet is a network of networks; it involves hundreds of thousands of smaller networks known as AS. Each of these networks consists in a large pool or routers run and administered by a single organization. Autonomous systems typically belong to Internet Service Providers (ISPs) or other large organizations (technology agencies, universities, government agencies, scientific institutions, …). Each AS is represented by a unique number called ASN (Autonomous System Number). Internet Assigned Numbers Authority (IANA) assigns ASNs to Regional Internet Registries (RIRs). These in turn assign them to ASNs owners.

In 2020, the number of ASNs nearly reaches 100.000. Already in the late 80’s, the number of AS grew in such a way that the limitations of EGP became more pronounced.

In June 1989, the first version of a new routing protocol was formalized. Its name is BGP, which stands for Border Gateway Protocol. The current version of BGP is version 4, published in 2006 under RFC 4271.

Compared to its predecessor EGP, BGP supports fully meshed topologies, making multi-paths routing possible. BGP is used to route traffic from AS to AS. In this case, we are talking about eBGP (external BGP). It makes intelligent routing decisions based on different parameters like reliability, speed and cost. The routing is said to be “policy-based”.

Within an AS, other routing protocols like OSPF, EIGRP and IS-IS can be freely used. More generally, we are talking about IGP (Interior Gateway Protocol) when it comes to routing within AS's. iBGP (interior BGP) is also one protocol that you may use for this.

iBGP vs eBGP

How BGP impacts the performance of your platform

As a conclusion BGP drives the network performance on public networks and hence has a direct impact on the digital experience of your users if they use internet to connect to your application.

  • To understand how you can monitor network performance in public networks, I recommend that you take a look at this article.
  • To understand how BGP can impact the reachability of your app or platform, you should read this article.

 


How network latency drives digital performance

Why network latency drives digital performance

Why network latency drives digital performance

Each time a packet traverses a network to reach a destination, it takes time! Network latency drives performance.

As this blog article explains, latency is the  delay for a packet to travel on a network from one point to another. Different factors like processing, serialization and queuing, drive this latency. When using newly hardware and software capabilities, you can potentially reduce the impact these elements have on latency. But there is one thing you will never improve: the speed of light!

As Einstein outlined in his theory of special relativity, the speed of light is the maximum speed at which all energy, matter, and information can travel. With modern optical fiber, you can reach around 200.000.000 meters per second, the theoretical maximum speed of light (in a vacuum) being 299.792.458 meters per second. Not too bad!

Considering a communication between New York and Sydney, the latency is about 80ms. This value assumes a direct link between both cities, which will of course usually not be the case. Packets will traverse multiple hops, each one introducing additional routing, processing, queuing and transmission delays. You’ll probably end up with a latency between 100 and 150ms. Still pretty fast right?

Well, latency stays the performance bottleneck for most websites! Let’s see why.

The TCP/IP protocol stack

As of today, the TCP/IP protocol stack dominates the Internet. IP (Internet Protocol) is what provides the node-to-node routing and addressing, while TCP (Transmission Control Protocol), is what provides the abstraction of a reliable network running over an unreliable channel.

TCP and IP have been published respectively under RFC 791 and RFC 793, back in September 1981. Quite old protocols… 

Even if new UDP-based protocols are emerging, like HTTP/3 discussed in one of our future articles, TCP is still in use today for most popular applications: World Wide Web, email, file transfers, and many others. 

One could argue TCP cannot cope with performance requirements of today’s modern systems. Let’s explain why.

The three-way handshake

As stated before, TCP provides an effective abstraction of a reliable network running over an unreliable channel. The basic idea behind this is that TCP guarantees packet delivery. So it cares about retransmission of lost data, in-order delivery, congestion control and avoidance, data integrity, and more.

In order for all of this to work, TCP gives each packet a sequence number. For security reasons, the first packet does not correspond to the sequence number of 1. Each side of a TCP-based conversation (a TCP session) sends a randomly generated ISN (Initial Sequence Number) to the other side, providing the first packet number.

This information exchange occurs in what is called the TCP “three-way handshake”:

  • Step 1 (SYN): The client wants to establish a connection with the server, so it sends a packet (called a segment at TCP layer) with SYN (Synchronize Sequence Number) signal bit set, which informs the server that it intends to start communicating. This first segment includes the ISN (Initial Sequence Number).
  • Step 2 (SYN/ACK): The server responds to the client’s request with SYN/ACK signal bits set. It provides the client with its own ISN and confirms the good reception of the first client’s segment (ACK).
  • Step 3 (ACK): The client finally acknowledges the good reception of the server’s SYN/ACK segment.

At this stage, the TCP session is established.

The impact of TCP on total latency

Establishing a TCP session costs 1.5 round trips. So, taking the example of a communication between New York and Sydney, this introduces a setup delay typically between 450 and 600ms!

This is without taking secured communications (HTTPS through TLS) into consideration, which introduces additional round trips to negotiate security parameters. This part will be covered in a future article.

How to reduce the impact of latency on performance?

So how to reduce the impact of latency on performance if you cannot improve the transmission speed?

In fact, you can leverage two  factors:

  1. The distance between the client and the server
  2. The number of packets to transmit through the network

There are different ways to reduce the distance between the client and the server. First, you can make use of Content Delivery Network (CDN) services to deliver resources closer to the users. Secondlycaching resources makes data available directly from the user’s device.  No data at all to transfer through the network in this case.

In addition to reducing the distance between the client and the server, you can also reduce the number of packets to transmit on a network. One of the best examples is the use of compression techniques.

Nevertheless, the optimization you can achieve has some limits, because of how transmission protocols work… The TCP handshake process does require 1,5 round trips. The only solution to avoid this would be to replace TCP by another protocol, which is the trend we’ll certainly see in the future.


network speed for user experience

Network performance and user experience: network latency vs throughput vs packet loss

Network performance for user experience: network latency vs throughput vs packet loss

What is the network performance metric which has the greatest influence on user experience and the delivery of your applications? 

This is a common question for anyone delivering critical applications on public and long distance networks. Which metric should I focus on to maximize the speed of data transfer that support modern applications? 

Network performance: throughput is what matters most for user experience

If you put yourself in the shoes of a user connecting to an application, what matters is how fast you see your application appear and how fast you can interact with it. What matters from a user experience: 

  • how fast you establish your connection to the digital assets that provide the different resources
  • how fast you load them.

From a user point of view, network performance is about throughput!  

How to measure throughput? 

First clarification, many people mix up throughput and bandwidth. Although they are related concepts, they measure two different things. 

  • First, throughput is the speed at which two devices actually transfer data from one to another. 
  • Second, bandwidth corresponds to the maximum amount of data that can be transferred on a link. 

We use the same unit for both metrics: bits or bytes per second. 

While throughput can be measured easily, it is quite hard to measure whether this throughput represents the maximum speed a user can get. Quite often network operations will look at the drivers of throughput to identify potential bottlenecks. 

What drives throughput?

In case of network degradation or outage, throughput drops. Monitoring network performance is a must to identify when the network is slow and what is the root cause.

Whichever tools you are using (packet analyzer like Wireshark, SNMP polling like PRTG or Cacti, Traffic loading, active testing) you need indicators that will help you understand whether your users can make the most of the network infrastructure to transfer data. 

This article explains 3 key metrics of network performance (latency, throughput and packet loss), how they influence the speed of transfer depending on the protocol used (UDP or TCP).  

  • Latency is the time required to transmit a packet across a network:
    • There are different ways to measure latency: round trip, one way, etc.
    • Any element on the path used to transmit data can impact latency: end user device, network links, routers, proxies, local area network (LAN), server,…
    • The ultimate limit of latency on large networks is… the speed of light.
      If you wish learn more about latency and the different ways to measure it, I recommend you take a look at this article
  • Throughput is the quantity of data being sent/received by unit of time
  • Packet loss is the number of packets lost per 100 packets sent by a host

Once we understand each of them right, let’s look at how they interact with each other. 

Understand the impact of latency and packet loss on throughput

This can help you understand the mechanisms of network slowdowns.

The protocol used for the communication will impact how things work, so we have to analyze things in a different way for UDP and for TCP. 

Measuring Network Performance on UDP

Latency has no impact on throughput on UDP

When using UDP, we assume that all packets sent are received by the other party (transmission control is executed at a different layer, often the application itself).

In theory or for some specific protocols (if no control is undertaken at a different layer; e.g., one-way transmissions), the rate at which packets can be sent by the sender is not impacted by the time required to deliver the packets to the other party (= latency). The sender will send a given number of packets per second, which depends on other factors (application, operating system, resources, …).

As a conclusion, latency has no impact on throughput on UDP.

Measuring network performance on TCP

Latency has a direct impact on throughput on TCP

TCP is a more sophisticated protocol: it involves a transmission control which checks the proper delivery of all packets. This mechanism is called acknowledgment: the receiver responds back with a specific packet or flag to the sender to confirm the proper reception of each packet.

TCP Congestion Window

As an optimization, not all packets will be acknowledged one by one; the sender does not wait for each acknowledgment before sending new packets. Indeed, the number of packets that may be sent before receiving the corresponding acknowledgement packet is managed by a value called TCP congestion window.

How the TCP congestion window impacts throughput

If we make the hypothesis that no packet gets lost; the sender will send the first set of packets (corresponding to the TCP congestion window) and when it will receive the acknowledgment packet, it will increase the TCP congestion window; progressively the number of packets that can be sent in a given period of time will increase (and so will the throughput). 

The delay before acknowledgement packets are received (= latency) will have an impact on how fast the TCP congestion window increases (hence the throughput).

When latency is high, it means that the sender spends more time idle (not sending any new packets), which reduces how fast throughput grows.

The test values are very explicit:

Measuring network performance – TCP throughput vs latency

Round trip latency TCP throughput
0ms 93.5 Mbps
30ms 16.2 Mbps
60ms 8.07 Mbps
90ms 5.32 Mbps


Retransmission and packet loss impact throughput on TCP. 

How TCP congestion manages acknowledgment packets which are missing

The TCP congestion window mechanism manages missing acknowledgment packets this way: if an acknowledgement packet is missing after a period of time, the packet is considered as lost and the TCP congestion window is reduced by half (as the throughput – which corresponds to the perception of limited capacity on the route by the sender); the TCP congestion window size can then restart increasing if the client or server has received the acknowledgment packets properly.

Packet loss will have two effects on the speed of transmission of data:

  1. Packets have to be retransmitted (even if only the acknowledgment packet got lost and the packets got delivered)
  2. The TCP congestion window size will not permit an optimal throughput

Measuring network performance – The impact of packet loss and latency on TCP throughput

As an illustration, with 2% packet loss, TCP throughput is between 6 and 25 times lower than with no packet loss.

Round trip latency TCP throughput with no packet loss Round trip latency TCP throughput with 2% packet loss
0 ms 93.5 Mbps 3.72 Mbps
30 ms 16.2 Mbps 1.63 Mbps
60 ms 8.7 Mbps 1.33 Mbps
90 ms 5.32 Mbps 0.85 Mbps

This will apply irrespective of the reason for losing acknowledgement packets (i.e., genuine congestion, server issue, packet shaping, etc.). 

As a conclusion, if you monitor network performance to maximize user experience your primary focus should be on packet loss rates.


What is network latency? How it works and how to reduce it

What is network latency? How it works and how to reduce it

Anyone delivering applications over networks sees network response time as a critical factor for digital experience.

Let’s explain the basics of network latency:

  • What is network latency?
  • How to measure network latency? 
  • How it works? What drives to a slow or a fast response from the network. 
  • How to reduce it?

What is network latency? A definition

First, latency is the  delay for a packet to travel on a network from one point to another. In most cases, we will refer to network latency as the time needed for a packet to move from a client / user to a server across the network.

Multiple flavors of network latency

Secondly, there are multiple ways to measure network latency:

  • one way delay or one way latency: this is the time needed for a packet to go from one end (the user device) to another end (the server)
  • two way delay or round trip time: this is the time needed for a packet to go from one end to the other end, and back

Depending on how you are try to measure latency, you will get either a round trip delay or a one way delay metric. As the conditions on the network vary and may be different in both directions, both types of measurements have pros and cons.

How to measure network latency? 

Let’s look at the different ways to measure latency.

  • Traffic capture for TCP based traffic

Capturing packets will give a feel for round trip delay provided you capture TCP based traffic. The TCP/IP protocol includes an acknowledgment mechanism, which can be a good basis to evaluate the two way latency between a client and a server.

This approach has 2 significant drawbacks:

  1. First, acknowledgments can be delayed and handled in different ways depending on the system (read this article for more info).
  2. Second, the status of the systems (clients and servers) can impact the level of priority put on dealing with the acknowledgment mechanism; as an example, an overloaded server may delay ACKs and generate high RTT values, which do not align with real latency.

The metrics you can expect:

  • RTT (Round Trip Time)
  • CT (Connection Time) which measures the time required to execute the TCP session establishment (SYN – SYN/ACK – ACK). This has an upside: systems handle these steps with a higher priority – this means this rarely suffers a delay and does not generate wrong latency measurements. On the other hand it also a downside: it represents one and a half round trip, which is not easy to interpret.
  • TTFB (Time To First Byte) was in the past sometimes considered an evaluation of latency, but this is not the case anymore. Earlier, some of us used to consider it as the time interval between the SYN packet and the first packet of response from the server. Nowadays, TTFB corresponds to the response time measurement between  a web request  and the first packet returned in response by the server.  Even in its ancient definition, it reflects poorly network latency as it incorporates the server response time. 

A side note on network latency measurement based on traffic capture: on UDP on can easily monitor Jitter (put in simple words the standard variation of latency) but cannot measure latency this way.

Put in a few words, you can measure latency with a traffic based approach by capturing traffic. The pro is that you do not need to control any of the two ends. The main inconvenient is that the metric is particularly imprecise.

  • Active testing of the network

You can also test your network latency by emitting packets and checking the time needed for them to make the one way or two way trip.

ICMP based testing

  • The client computer sends an ICMP Echo request packet (commonly known as “ping”) to the target computer.
  • The target machine receives the request packet and builds an ICMP Echo Reply packet.

The client computer will use the timestamps corresponding to the emission of the Echo Request and the reception of the Echo reply to calculate the round trip time.

The cons of ICMP are:

  • ICMP is handled on networks in a different way than protocols which convey application data (mostly TCP and UDP). As an example it can get a higher or lower priority on operator networks.
  • The handling of ICMP requests by network devices is a second class service and this may impact the quality of the latency measurement.
  • ICMP handling may be disabled on the target device which may simply not respond to ICMP requests. 

UDP / TCP based testing

You can send TCP or UDP packets and validate the reception of the packets.

The pro is that it is possible to measure one way latency and get more precise results.

The constraint is that you either need to have control of both ends or to use a testing protocol which has to be supported by both ends, for example TWAMP.

How it works: what drives network latency?

The 4 main factors that drive latency are:

  1. Propagation

This is the time required for a packet to go from one interface of a network device to another network device’s interface over a cable. The rules of physics drive this factor, so you can expect it to be very stable (approximately the speed of light x 2/3). Put in a few words, the shorter the distance you need to cover in  your network, the lower latency you will get.

2 & 3. Processing and Serialization

The serialization time on a network equipment is the time needed for a packet to be serialized for transmission on a cable. It depends on the packet size but remains constant and relatively negligible.

The processing time is a lot more variable on each network router or equipment on the network path. It depends on:

  • the services provided by each router (bridging, routing, filtering, encrypting, compressing, tunneling)
  • the capacity of the device vs its current load.

When considering the overall path, the number of hops / routers the traffic goes through will impact the overall time necessary for all the serialization and processing included in the total network latency.

  1. Queueing

This is the time spent on average by every packet in the router queue.

It mostly depends on the size of the queue.

The size of the queue depends on the overall traffic size (vs its capacity) and its burstiness.

Longer queues will translate in additional latency and jitter.

How does it translate in latency for an overall network path?

Consider these drivers in the context of an end-to-end network path

The drivers for the end to end network path will be by decreasing order of importance:

Drivers and main factors influencing them

  1. Propagation

Geographical distance on the overall network path drives that time. 

  1. Processing

The overall processing time depends on:

  • Number of hops on the path
  • Type of function activated
  • Load vs. Capacity for each router
  1. Queuing

The overall queueing time depends on:

  • Queue size (hence load vs capacity of the router or link)
  • Number of hops on the path with queueing
  1. Serialization

The overall serialization time depends on:

  • Number of hops on the path
  • Packet size

Next steps