User to cloud network path monitoring is an essential part of network performance management taken to the cloud age. It drives a great part of your users’ digital experience and while people think “because it is in the cloud, it is all taken care of”, well it is not always taken care of and there is a lot you can do to optimize it… provided you have the right visibility.

Cloud: It is all taken care of, isn’t it? 

One of the main reasons for moving workloads to the cloud is to completely outsource your datacenter infrastructure for a better one. The cloud is more secure, more stable, more scalable, more automated and more flexible (see this article for more). Basically it is all taken care of and done in a much better way that you can ever do it. It also takes away a lot of things you had to manage until then (datacenter operations, systems, backup, virtualization, capacity planning, connectivity…).
Among other things, you rely on your cloud service provider(s) to provide the right connectivity on public networks for your users to access your applications through the Internet. In principle, they monitor network traffic and network performance as part of that service. 

Most people understand that they do not need to monitor connectivity from their users to the cloud. They assume their cloud provider does it in accordance with the best practices. 

Network performance on the user to cloud path: what can go wrong? 

Your cloud provider obviously monitors it internet reachability. But this monitoring does not reflect things that can still happen to your users-to-application connectivity. 

Network incidents on public and cloud networks

Network outages can impact the availability or performance of your cloud providers. These incidents can occur at different levels: ISPs, peering / transit points, tier one networks or on the cloud gateways themselves or affect essential services like DNS. 

They may also be the consequence of attacks like BGP hijacks and DDoS targeting the infrastructure of your CSP. 

Depending on the cause and the part of the infrastructure which is impacted, the scope of degradation may vary from  only certain users / geographies to  all and from some services to all. 

An example of this would be the incident impacting IBM Cloud in June 2020; the network outage was caused by a failure from one of IBM Cloud’s 3rd party network providers (details here). 

Unpredictable changes at multiple layers

The path from your users to the cloud is influenced by a series of network infrastructures whose behaviors can be unpredictable: 

  • The users’ own behavior, his/her local network connectivity and the security gateways used
  •  How the users operators  route the traffic to your platform based on their peering / transit arrangements
  • The BGP policy influencing the AS path in both ways (if you are interested in this specific topic, we recommend that you take a look at this article)
  • All the congestion and degradations taking place on the path (either driven by traffic or device malfunctions)
  • The destination to which DNS servers will point your users (your hostname may be resolved in different IP addresses based on your DNS setup)
  • The behavior inside your cloud infrastructure at the network level but also how they manage the load on the server side
All the uncertainty performance factors on the path from the users to the cloud platform.

Uneven cloud coverage

Depending on where each of your users is located, your CDN (Content Delivery Network) and cloud platform may be well located. 

If you are using one of the leading cloud service providers (say GCP, Azure, AWS or Alibaba). You have the choice of spreading your compute capacity in multiple regions to put your front end (and eventually some of the back end compute) closer to your users. This is a good starting point, but not all CSPs are made equal and that applies to geographical coverage. The same thinking applies to CDN providers. 

Some areas are particularly uncovered by global providers (take the examples of Africa and LATAM as the most obvious) ones: 

AWS in 2020

AWS geographical coverage 2020Azure in 2020

Azure geographical coverage 2020

GCP in 2020

GCP geographical coverage 2020

The prices for the same services in different regions can vary. This makes a massive difference to the actual coverage, by making certain zones quite prohibitive. 

Let’s assume that you are running a multi cloud infrastructure. You may reduce the gap by leveraging each provider’s specifics to your advantage. 

Nevertheless, depending on the region where your users are located, your CSP will remain further away from them from a latency standpoint. 

What should you do to optimize your user to cloud network path? 

What are the key steps to optimize the path from your users to your cloud platform? 

  1. Understand where your users are and beyond the rough numbers, which regions / countries generate the largest part of your revenue
  2. List all the hosts and services used to deliver your digital service (including your CDN and 3rd party services), make sure you have a clear view of how they are hosted. 
  3. Identify the user to cloud connectivity gaps which are structural for your strategic regions
  4. Know of events / incidents which impact user performance in the day to day and distinguish quickly between those you and your providers can take action vs the ones where no action can be taken. 

Getting the right observability: network path monitoring 

The very first thing you need is observability on the user to cloud path and the network performance attached to it. 

You need to

  1. know where your users are located, with what experience:
Geography of digital experience
Geographical distribution of users by experience
Digital Experience Scope Analysis
Understand what drives digital experience in each geography

2. Understand which part of your app is standing close or far (meaning with short or high latency) for the key ISPs (Internet Service Providers) offering connectivity to your users. 

User Latency by Host and ISP
Understand the latency between users by ISP and your platform

3. monitor this 365 days a year and to locate where degradations and changes on the route to your cloud platform are coming from: 

Example-network-path-degradation
Example of a network path degradation to a cloud platform

Here is an example of a strong degradation on a route. Packet loss, latency and the number of hops from users to the application platform bursted for a period of time. It is good to know whether these changes are one off events or remaining for long. 

Understand the root cause for network performance to the cloud

Once you identify the event, you have to define whether it is within the scope of what you control directly or indirectly: 

  • Did the cloud destination (host) change location?  
  • Did the route change? 
  • Is there congestion on the way? Where is it located?
Network Path Visualization
Visualize multiple DNS resolutions driving to different paths with different network performance levels

Comparing cloud performance in multi cloud deployments

If there is a change in the structure or the network performance on the path is not correct, what are your options? 

Cloud providers

If you run a multi cloud architecture, you can consider:

  • switching to another gateway for the considered region or
  • use another way to access your cloud (e.g. AWS global cloud accelerator) or redirect to another cloud provider. 

CDN providers

In the same way, the CDN providers offer very different regional coverage. For static content, switching to another CDN provider for the considered region can be an excellent option. 

3rd party service providers

Your 3rd party providers also have an underlying infrastructure whose performance will vary on a per region basis. 

To make decisions on your architecture, you need hard data on

  • the performance from the location of your users to the different elements of your platform
  • the route taken and
  • the resulting performance (network latency, packet loss, number of hops and stability). 

To find out more on how to manage performance in public networks, take a look at this article.
We also recommend that you take a look on how to deploy cloud services at global scale with the best possible user experience: article.