Builder’s Library Notes #2 – Monitoring and Logging

Welcome to the second post in my series of Builder’s Library Notes. Last time, I focused on pieces related to continuous delivery. This time, I’m focusing on papers and talks related to Amazon’s monitoring process. These include papers that deep dive dashboard, health checks, and instrumentation as well as a couple about the challenges of fairness and designing distributed systems.

Amazon’s approach to failing successfully by Becky Weiss

This ReInvent talk is about accepting that failures can happen and to make the most of them to improve your processes. Specifically, this is about AWS’s approach to post mortems. The best case scenario for a failure is to see it before your customers do and tooling should help gain visibility and insight into the issue. As Warner Vogel says, “there’s no compression algorithm for experience.” A correction of error goes into a structured analysis of a customer impacting event and goes well beyond simply asking how to stop it from happening again. Hundreds of employees are at the meetings and are open to help spread learnings across the company. Correction of errors start with the customer’s impact and work backward. The issue is summarised with narrative descriptions, metrics and graphs show the primary impact and help to dig deeper with supporting graphs. It’s important to know what the customer impact was, who was affected and what the experience was.

There are a lot of angles that can be approached for improving the process .For determining the root cause, a popular tactic is asking why on each answer to why did the issue occurred, 5 times total. This can help get to a more detailed issue than “the service was down” and narrow down what needs to be fixed. For helping to limit an error, it’s important to contain the blast radius and reduce it further. Containment is a feature and the more cellular the architecture, the better. Controlling the event duration will help improve the response as well, look into how the event was detected and how it could be detected faster. Cloudwatch alarm and automated detection is the best way, with human detection being the worst case. How was the mitigation process figured out and how can that happen faster. Automated rollback can be the best case since no human work is needed.

Metrics and dashboards can help show errors before customer see them. Health metrics can show if failure is occurring and alarms should be set on them, but don’t answer why the service is failing. Diagnostic metrics can be defined more liberally and provide directions to help figure out the cause of the error. When defining metrics, percentiles are more meaningful as average is easily distorted by outliers. Some of the concerns from Building dashboards for operational visibility are repeated here as well such as replacing sparse metrics with zeros and how to layout dashboards. AWS services can help detect errors with Cloudwatch for metrics, logs and alarms and services such as Lambda and EC2 via agent can send logs. Cloudwatch can also help with anomaly detection and logs can be queries with Insights to help dig deeper.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/amazon-approach-to-failing-successfully/

Beyond five 9s: Lessons from our highest available data planes by Colm MacCarthaigh

This talk by Colm talks about running high availability architecture and keeping things running. He talks about both leadership standards and technological patterns for achieving these goals. This article talks about the divide between control plane servers which set configurations and tell the data plane servers what work to do, while the data plane servers do most of the important work. Services such as Route 53 have 100 percent availability SLAs and IAM is a service invoked with every single API call, showing the need to keep services available.

This is an example of “insisting on the highest standards” which is one of Amazon’s leadership principles and in this case means “raising the bar” with thousands of unit tests, hundreds of integration tests and simulations matching prod with realistic workloads. Every lesson learned becomes a regression test and behavior when rolling forward and back is also tested. These tests are automated with CI/CD Pipelines and there are additional papers detailing this process. Another leadership idea is that teams are “technically fearless” and that healthy services are built by healthy teams. The goal is to have teams that don’t blame and can voice their faults and concerns safely and be encouraged. Fear can let you know what to test and scrutinize and nothing good can come from trying to cover up a failure.

Some technical concepts for achieving these goals are modeling a system and proving it works for everything. Focusing on limiting blast radius can also help to limit an issue if something unforeseen arises. Services should be designed so that a failure stays inside of a blast radius, ideally contained to a server though services are split into availability zones and regions for this reason. Shuffle sharding in another example of improving this. Similarly, the idea of modular separation splits a system into components that call each other and can limit the failure to the individual service. Limiting features and doubling down on simplicity can help achieve these goals and the added testing should be justified by the needed service. Ultimately, the most critical components should be the least complex.

Services should be able to handle the failure of other services as well and keep working in other zones, regions or stripes if one fails. Static stability (has its own article) and is a pattern where a failed service can go back to working fully with a reboot by limiting dependencies to only what’s necessary. Still expected to encrypt data and keep keys in memory or use KMS as allowed low level dependencies. Additionally, when failures do occur, some level of service should still be provided, such as still running a system with limited logging.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/beyond-five-9s-lessons-from-our-highest-available-data-planes/

Building dashboards for operational visibility by John O’Shea

This article is all about building dashboards and goes into a lot of detail on different types of dashboards. Amazon has thousands of servers and containers running across availability zones and regions and automates monitoring and remediation such as traffic shifting or deployment system in order to detect and resolve issues at that scale. It’s important to be able to see what a process is doing at any moment. Dashboards help to keep on top of activity by providing human facing views to systems providing summary of their behavior by displaying metrics, logs, traces and alarm data. Creation, usage and maintenance of dashboards is a first class activity known as dashboarding and is just as important as designing or building services.

Generally, no one is watching the monitoring system and processes requiring manual review are subject to fail due to human error. Automated alarms evaluate important data to indicate that the system is approaching limits for proactive detection or requires reactive measures if the system is impaired. Alarms can trigger remediation workflows and notify operators, pointing them to dashboards and runbooks with information on what to do. It’s important to know what remediation is doing and potentially require a manual approval. During an issue, multiple on-call operators are used to step through tasks and quantify impact as well as to trace through services to determine the root cause and observe or run necessary mitigation steps.

Peer teams and stakeholders will also monitor the impact using dashboards and weekly review meetings are held to go over audit dashboards containing data from all availability zones and regions. For long term planning, higher level metrics are typically used. Different types of dashboards are used based on who will use it and why. Most people create overloaded dashboards and dashboards should be created working backwards from the customer.

Types of high level dashboards include customer experience, system level, service instance, service audit and capacity planning or forecasting dashboards. Customer experience dashboards are the most important dashboard and are widely used at Amazon, designed for a broad group including service operators and stakeholders. These contain data on overall service health and progress to goals and contain data from services, tensters and client instrumentations such as SDKs. The data can be used to determine the depth and breadth of the impact and know who was impacted and how much.

System level dashboards have data for operators to see how a system and its customer facing endpoints (UI or API) are behaving. Three categories of monitoring data are collected per API; input data, processing data and output data. For input data, the number of and size of requests and any authentication failures are examples of data that could be collected. If a queue based system is used, then work pulled from queues could be an example of input data. Processing data contains business logic for different modes and counts for which code paths were taken. Backend microservices data such as request count, failure, latency, logs and request trace are also included here. Finally the output category contains counts of each response type, error breakdowns per customer, response size and percentiles for the time to write the first response. Both customer experience and system level dashboards are kept as high level as possible avoiding including too many metrics to help show a clear message.

Service instance dashboards evaluate the customer experience on a single service instance. Service audit dashboards contain info on all instances in all availability zones and regions for a service and are used to audit automated alarming as well. These are commonly reviewed in meetings. The last type of high level dashboard mentioned are capacity planning and forecasting dashboards which are used for longer term use cases such as visualizing service growth over time.

There are also low level dashboards usually relating to a specific microservice with metrics based on each unit of work. These can be implemented by the team that owns the microservice or can be created for a dependency of that team’s service. These low level categories include microservice specific dashboards, infrastructure dashboards and dependency dashboards. Microservice specific dashboards require deep knowledge of a service and its implementation and are usually used by the service owner. The data is typically aggregated to be friendly to an operator and contains links to tools and information to dive deeper with queries and trace requests. Infrastructure dashboard contain metrics emitted from infrastructure focusing on compute resources the services run on such EC2, EKS or Lambda. These metrics can include CPU utilization, network traffic, disk I/O, space utilization, autoscaling and quota details plus more as needed. Lastly, dependency dashboards are dashboards to see how upstream services such as load balancers and downstream services and data stores are behaving. These dashboards can also monitor security certificate expiration dates and dependency quota usage.

When designing dashboards, there should be a consistency across dashboards. Dashboards should be accessible to the broadest object and success is based on how quickly a new operator can understand and use it to learn the service. Building dashboards backwards from customers can help to avoid building a dashboard only the creator understands. Since dashboards render from top to bottom, it’s suggested to put the most important information at the top, which is usually aggregated availability and end to end latency percentile graphs for web services. More important metrics should have larger graphs and legend shouldn’t interfere with graph data. To make dashboards more easy to follow, make them for a minimum display size to avoid requiring scrolling. The time zone should be displayed, shortest time interval or data point period should be used and thresholds should be shown as horizontal lines (split graphs with two y-axis and add lines separately). Avoid plotting too many unnecessary metrics and data points. Alarm status and simple numbers as widgets can help keep things simple. If sparse metrics are included, zeros should be included so that missing data can be assumed to be a problem. Additionally, for functions with different modes, a different graph should be used for each.

These dashboards are created via pipeline for different environments and are evaluated based on usefulness in postmortem Developers are permitted to update the dashboard as well as needed and should provide context to the meaning of metrics along with what normal should be and include links to help determine the root cause. These links can include runbooks, dashboards, pipelines and contact info.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/

Instrumenting distributed systems for operational visibility by David Yanacek

Reducing high percentile latency often reduces the median latency, but the reverse isn’t true as often. Each component’s metrics should be measured on both its own behavior and how services that call it perceive its performance. By bringing these metrics together, it’s often easier to find the source of the problem. To help debug, explicit instrumentation can be added to code rather than relying on default tasks. Time of tasks, code paths taken, success or failure and metadata about what was worked on can all be output by the code along with info on the caller, storage failure and size of objects. By logging anything that could be of use, you can ensure that a log is there with the needed information to solve a problem. Libraries can help to implement logging as well, but its important to know what needs to be logged. Trace IDs are another important tool for tracking requests across services and to collect them in one place (X-Ray can help with this task). High level metrics are useful as they can help to narrow down what areas need to be focused on and direct you to API logs containing info on the source of the problem.

Instrumentation requires coding and Amazon uses standard libraries and log based metric reporting standards for their implementations. They can get metrics on HTTP calls automatically with a custom metrics library and telemetry data will be written to a log file per unit of work. Aggregation can be run later to collect logs and calculate metrics. Two common types of log data are request data and debugging data. Request data has one log per unit of work with information on who made the request, what it was made for, and a detailed trace with counters of actions made and the time taken. It’s important to keep all data from a request to a single log to be able to find all info in one place, though long running tasks may need to emit data periodically with multiple files to signal that they’re still working. Debugging data is unstructured lines written by the application. The Cloudwatch agent can produce both in real time and aggregate metrics can be produced from Cloudwatch logs, triggering alarms as needed. The logging lifecycle should be planned upfront for a non-blocking service as it can be difficult to fix later. Logging can be expensive but its an important to know how your system is running.

It’s important to log request details before validation (avoiding passwords) and having an ability to increase log verbosity as needed. Metrics and names should be short but detailed and volumes should be able to handle logging at maximum throughput, logging and shipping on different partitions. Testing volume logging capabilities is critical and nearly full disks should be detected before failure. Again, sparse data should use zero counts and clocks should be synchronized.

Amazon logs availability and latency of dependencies manually and avoids having to compare different services logs. Their dependency metrics are broken out by calls, resources, and status codes and different latency metrics are kept for different status codes size. Memory queue depths are recorded when memory is accessed, along with enqueue time metric for queues. A counter is added per error reason with errors organized by at least client or server faults. Metadata is logged per unit of work, who and why with different expected metadata per resources. Logs are protected via access control and encryption and overly sensitive information is omitted. Log trace IDs are also propagated to backend calls to help trace.

Best practices for application logs are to use an unstructured debug log but review the noise level, potentially rate limiting log error spam. Keep the request ID in the logging and don’t waste work formatting if not called (use format on writing). Log requests IDs from failed service calls to help follow up with the service owner.

For high throughput services, often its still best to log all requests, though this can create a gray area. If too much, log sampling can be used writing every Nth entry or using an algorithm such as reservoir sampling. This can cause issues for troubleshooting a specific request and compliance may stop sampling from being an option. Separate threads should be used for serializing and log flushing and logs should be rotated frequently to spread out the workload. Compressed logs can be streamed and write precompressed to help avoid a CPU spike when archiving. In memory aggregates can be helpful with too many requests though you can lose visibility. It’s also important to monitor resource utilization and make sure logging agents can keep up with requests. Log analysis can be done locally with linux command line utilities or distributed with big data analytics and AWS services such CloudWatch, X-Ray and Athena.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/

Implementing Health Checks by David Yanacek

This paper is about how to use health checks to deal with server failure and what happens when they aren’t used properly. Health checks are a way of asking a service if it can perform its work and are often used by load balancers, monitoring systems and services themselves. Failing services may not simply shut down and can impact the rest of the system if not detected and responded to. An example of a small failure having a disproportionate impact is that a failing webserver responding with empty HTML can appear to have the fastest response time and will be prioritized by a load balancer, leading to more failures. The load balancer algorithm, “least requests” is commonly used and can be mitigated by slowing down failed requests, but this is more difficult if queue pollers are used instead as they can keep taking on jobs. Partial failures can lead to failures with authentication, writing to disk, encryption, bugs, dead locks and more. Servers can also fail for correlated reasons such as dependencies or networking issues. It’s important that health checks test every aspect of a server and application even if the process isn’t critical, but it’s important that automation doesn’t remove servers from service if not absolutely necessary. This balance between thorough health checks and quick mitigation versus false positives is an important design consideration. A common solution is to stop directing traffic to a single failing server but keep running if the whole fleet has issues.

Ways to measure a systems health include liveness, local, dependency health checks and anomaly detection. Liveness health checks test basic connectivity and that a server process exists. These are used by load balancers and monitoring agents and typically come with setup with a service. Liveness checks don’t care about the application process itself and only look for responses such as an HTTP 200 response or checking an EC2’s server status. Local health checks should be customized by the user and verify that the application can function. They test resources not shared by peers to avoid impacting many servers. Examples include disk access availability, critical processes returning a correct response besides just 200 and monitoring support daemons. Dependency health checks test that the app can interact with needed services. They can catch expired credentials and make sure communications are working, though they can also give false positives if a dependency has an issue, so they shouldn’t be hastily reacted to. They can test for stale metadata and configurations, network issues and other bugs requiring process bounce.

Anomaly detection is a good catch all and notices if a server is behaving differently from its peers. These health checks can be created by aggregating monitoring data and comparing different servers, which can find issues that checking a single server wouldn’t find. Anomaly detection can detect issues such as clock skew which can happen which can cause AWS requests to fail if not within minutes of the actual time. Old code can also cause problems if a server came back online with stale code and unanticipated failure modes can create issues as well and both can be caught by anomaly detection. This is a useful tool if servers are doing approximately the same time and of the same instance type. Load balancers can also help provide metrics for server response times and types to help with this process.

Ways to react to health check failure include servers taking themselves out of service or informing an authority and letting it decide what to do. Network load balancers fail open and keep working if no servers are healthy but will stop sending to an individual availability zone or server that’s individually reported to be unhealthy. Application load balancers and Route53 support failing open as well. When relying on failing open, make sure to check failure modes of the dependency health checks as a dependency failure could take down servers if lag causes one server to fail even if nothings wrong with it. Amazon restricts quickly reacting to local health checks and uses centralized systems for dependencies with deeper analysis required. They use fail open sometimes but are skeptical about acting on many services at once.

When running health checks without a circuit breaker, the work product (load balancer or polling thread) should perform liveness and local health checks and remove servers based on local issues. External monitoring systems should be configured to perform dependency health checks and anomaly detection and automatically terminate or alert an operator.

Health checks are important processes that should be prioritized over regular work even if it’s overloaded to avoid being removed from service and making a problem worse. Make sure to respond to a load balancer ping in time and set resources aside for health check response. For example, you can use more max connections on the server than the load balancer to always leave room for health checks (up to double the proxy’s maximum). Servers can use local request concurrency enforcement allowing health checks and limiting others via semaphore or other tools. Be careful failing because of dependency health checks if other services of an application can keep working. One example use case is SQS which redeliver unprocessed messages and sets alarm for investigation if too many unprocessed messages build up. External services can detect server issues in cases like this and it’s important for services to have internal and external reporting and responses to health checks should include application version to avoid code getting out of sync.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/implementing-health-checks/

Challenges with distributed systems by Jacob Gabrielson

The different types of distributed systems include offline, soft real-time and hard real-time systems. Offline include systems that do batch processing, data analytics clusters, and other long running processes that don’t have anything waiting for a response while doing their job. They have all the upside of a distributed system, being scalable and fault tolerant, with almost none of the downsides. Soft real-time are critical systems which produce results with a large time window such as search index builders or cleaning up impaired servers. Hard real-time servers are request-reply servers, which are Amazon’s default and the most difficult to get right. Requests arrive unpredictably and a customer is waiting for a result which needs to be quick. These are the major focus of the article.

Hard real-time servers work by sending messages from one domain to another. When a message is sent, at least eight steps run. In the simplest description, the client sends a message over a network to a server and the server sends a reply over the network to the client. In detail, this means that the client posts a message to the network, the network delivers the message to the server, the server validates the message and updates its state. Then the server puts its reply on the network which delivers it to the client who validates the reply and updates its state accordingly. Code must be written to handle failure at any step of this.

Generally, unit tests don’t have to handle issues where a CPU fails in a single machine’s system system nothing else would keep running, but with a distributed system, other machines could be waiting on it and failure needs to be tested. To handle failure modes in hard real-time distributed systems, all aspects of network failure must be tested as servers and network don’t share the same fate. The client must handle timeouts and have a way to handle an unknown response or understand the reply and use it to determine code to run or an error to give. The server has to decide whether to give up when waiting on a client or update a keep-alive table and be able to look up a user’s info and give a response. Future requests from clients need to be handled whether failed or not. This means that one expression of code on single machine system becomes sixteen steps with additional code to handle different response types. A failed post, retry, fatal, unknown (such as timeout) or success will all run different code and even a successful call requires a validation on the response. Additionally, tests will need to be run on the client and server to handle any possible errors with the servers needing to simulate sequences of calls and invalid request as well.

Handling unknown unknowns is one major challenge as it becomes easy to distrust everything. The client might not know if its request succeeded and the server won’t know if the client received the update. Herds of real time distributed systems can add additional problems. Real time systems can have multiple machines viewable as individual machines, groups such as availability zones, or groups of groups (regions). The same problems will still apply but failure modes can change. Groups will use load balancers and one instance may have to send its state to another to avoid a state update failure. This adds additional networking at which all the same failures can occur. All connections will need to have tests for each case adding even more complexity.

If not caught, distributed bugs are devastating but can take a while to hit. They use networks making them more likely to spread across machines. An example at Amazon includes remote catalog server disks filling up and returning empty responses which the load balancer prioritized and took the whole website. This was fixed by removing the bad web server and processes were improved. It can be difficult to work with distributed systems since engineers must handle each error condition individually and results of network operations can be unknown. Any recursion makes this worse and networking laws can’t be changed.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/challenges-with-distributed-systems/

Fairness in multi-tenant systems by David Yanacek

As the Builder’s library talks about running distributed systems, one of the main challenges is handling multiple customers using the same services and servers. Resource sharing helps to make scaling and adding clients easier, but comes with some challenges. This article talks about “fairness” which is ensuring that one customer can’t overload a system and impact another customer’s ability to get their work done. Each customer should appear to have a single tenant experience. A commonly used method for this is rate limiting or throttling helping to shape traffic and prioritize clients still within capacity. Load shedding (rejecting traffic) can also be used to prevent overload.

The article talks about the transition from wanting to reuse databases by adding new tables in the early days to avoid the work of adding new databases and the drawback of a replica sync affecting multiple services at once. RDS helped automate a lot of work and helped single tenant systems run as separate databases and DynamoDB helped with a scalable multi-tenancy database capitalizing on server space via multiple servers and not needing to add a new EC2 instance for each. Many other services use this resource sharing as well such as Lambda with its Firecracker VMs, though they all have to deal with the challenges of multi-tenancy under the hood.

The fairness challenge is described as a sort of bin packing algorithm, using placement algorithms to find a spot for a new workload and be able to move workloads or add resources based on fleet utilization. Additionally, a workload should be able to go outside of its allotted boundaries if there’s room which the boundaries can be enforced when another customer within its limits needs access to the space. This switch back should be instant and go unnoticed. Quotas can also vary such as when a customer grows or in situations such as an EC2 instance boot where a workload may need more space for a brief period of time.

Services use many layers for admission control (systems shaping traffic and implementing quotas), both internally and externally such as with API Gateway, CloudFront, or WAF/Load Balancers or Shield for DDOS protection. Locally, a token bucket can fill up over time and remove tokens when requests are admitted, rejecting when empty. This allows for a controlled way of allowing workloads to burst. When rate limited, asynchronous services may be able to be handled later while synchronous will need an immediate rejection. While acknowledging that quota’s are inconvenient to customers, the transparency and burst options along with bulk/describe apis can help make the issues easier.

For the full article these notes are taken on, check out https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems

NikCreate