The Ping Mesh

We have all wondered how long it takes to get from here to there. Our younger selves may have asked, "Are we there yet?" Any uncertainty in the answer leads to a sense of angst or concern, which is why the question is repeated so often.

Similar questions and concerns apply to IP networks, like when we are waiting for a web page to load. We are connected to the internet at some point of presence, and the sites and applications we access are located elsewhere. Network latency has a huge impact on network performance [1], so it's natural to wonder

  • How long does it take to get from here to there?
  • How does network performance from A to B change over time?
  • What does wide area network performance look like across many sites?

The first answer we can measure with an application like perftest, which reports the detailed timing for HTTP/S requests from your laptop to any web server. But it might be nice to see results across a mesh … like this for example:

PM1-1

This point-to-point information about latency helps understand a critical factor that affect application response time. For example, your web tier may use database or other remote services to complete a request. Network latency between service points is key to application performance. The pingmesh app provides insight into network latency, or round-trip-time (RTT). We learn how the network is connected, where and when links break, and so forth by measuring to and from multiple points of presence. There aren't many applications that have the luxury of choice across many locations, which involves operational and management challenges. But over time we are seeing more of these widely distributed, replicated applications, where an understanding of network performance is important.

Network latency data can help inform the behavior of WAN acceleration and data replication applications, for example the choice of locations to route through, or to replicate to or from. If you are faced with poor network performance, or significant irregularities, your application will not work as well as when there is reasonable, stable performance. 

Network RTT data is also important to determine placement of replicas of a replicated caching or computing application: you do not need to put multiple replicas nearby each other for latency reasons, though you may need to for load distribution and balancing.

(I use the term "ping" for this application, but it makes HTTP(S) requests, not the ICMP Echo requests that the real ping utility makes. The app uses the TCP connection setup time to approximate the network RTT. The TCP handshake requires one round trip.)

Application Design

The pingmesh application makes requests to a set of peer pingmesh instances, measures network latency, and optionally reports that information to CloudWatch to visualize performance over time. Each pingmesh instance runs a web server that can deliver a simple location response (saying "I am here"), or return the list of currently active peers, and a web client that talks to its peers. It can measure RTT to other web servers as well.

Pingmesh can report peer response time information via the web interface, through log information written to stdout, and by publishing a metric to AWS CloudWatch. From stdout you can watch the performance of the active peers in a table with one line per request. On the web interface pingmesh returns self-describing, structured information in JSON both to humans and to peer application instances (or other applications).

You can learn more about the implementation in the README file on GitHub.

A Surprising Observation

Whenever I look at performance metrics I discover interesting information. Here's a CloudWatch graph of pingmesh across ten of our Rafay sandbox cluster locations showing the 10th percentile (p10) of network RTT between each pair of locations. The p10 metric is a close approximation of the latency – defined as the minimum RTT between two points. The median (p50) and especially average metrics are skewed by a few long responses, which I don't care about right now; I want to study network distance.

In the first graph above, we see performance is pretty stable over time, and symmetric, as expected. But there are some interesting dips and spikes, some of which are quite long lasting. These are usually due to changes in routing between data center. Here's one that is really dramatic: I've narrowed the above view to just three of the lines, showing a change that made a huge difference in one of them, a small benefit in a second, and a small increase in a third.

PM2-1

I hope you've found this interesting. The application is freely available in the Rafay open-source GitHub repo. If you need a set of widely-distributed clusters, Rafay's sandbox currently has a modest number. To access it just click Sign Up on https://app.rafay.dev for a free trial account. Once signed up you can also bring your own clusters to the Rafay platform, which provides abstraction, distribution, scaling, and lifecycle management of containerized applications like pingmesh.

You can learn more about Rafay on our blog, LinkedIn posts or on twitter @rafaysystemsinc. As always we would love to hear from you. Feel free to reach out to info@rafay.co, or contact me on LinkedIn.

References

  1. Latency and Response Time on LinkedIn
  2. Rafay Systems on LinkedIn
  3. rafayopen/perftest on GitHub
  4. rafayopen/pingmesh on GitHub

This blog was originally posed on LinkedIn.

Posted by John Dilley