Clabby Analytics has been following the APM (Application Performance Management) market for more than ten years. From tools that do “deep-dive” root cause performance trouble-shooting to middleware-centric APM solutions to end-user experience monitoring tools and network-based APM, the market has evolved over time to keep pace with the changing application and business landscape. Some APM vendors have responded by adding capabilities and offering comprehensive enterprise solutions that meld several approaches, while others focus on a specific niche. Many tools are now available in a SaaS (software-as-a-service) delivery model, driving down the acquisition cost for smaller businesses. It has been an interesting journey…
Today’s applications look very different than the traditional legacy applications of the past. Monolithic architectures have been replaced by application services comprised of hundreds of dynamic, distributed microservices with many different developers contributing code in many different programming languages and working across cloud and/or hybrid infrastructure. Theoretically, this modular approach means that each service or component can be managed in isolation of the others for greater efficiency and more rapid updates. But the reality is that the interrelationship between these microservices has made diagnosing problems and issues much more complex. The sheer volume of data that is collected in order to diagnose problems is more than many traditional APM tools can handle.
In addition, the rise of DevOps—close collaboration between Development and Operations intended to improve software quality and the pace of innovation—and the accelerating adoption of Agile methodologies has resulted in both developers and operations teams using APM tools. A complete end-to-end picture of all services all the time is required, so that both teams can understand how each microservice is performing as well as how they are interacting in order to identify bottlenecks, code defects, and other issues affecting application performance.
In this report, we’ll look closely at LightStep, a vendor that is closely following these trends and has delivered a solution, LightStep [x]PM, designed to monitor, diagnose and analyze application performance specifically for today’s complex microservices-based applications.
Background: Distributed Tracing and OpenTracing
Distributed tracing provides engineers with a detailed view of requests across microservices so they can identify issues and problems that are affecting overall application performance. It offers visibility into complex applications often comprised of hundreds of services distributed across cloud and hybrid architectures, shows overall performance in the context of how each service is performing and—equally as important—looking at latency between services. Distributed tracing is made possible by instrumenting application code, providing this end-to-end view of applications. However—as you can imagine—with so many application components written in different programming languages by different developers using different tracing tools (each with its own instrumentation) this process is both time-consuming and complex.
OpenTracing provides a standard and common API for distributed tracing and its instrumentation, streamlining the process and providing interoperability. In October 2016, OpenTracing became a project of the Cloud Native Computing Foundation (CNCF), part of the Linux Foundation and founded in 2015 to promote containers. CNCF is advancing OpenTracing as an open, vendor-neutral standard for distributed systems instrumentation, enabling end-to-end tracing of requests in microservices architectures across distributed systems.
LightStep –APM at Scale
LightStep was founded in 2015 by former Google Engineers including Ben Sigelman, who while at Google, designed and developed global-scale monitoring technologies including Dapper, a distributed system tracing infrastructure capable of analyzing transactions at scale — up to 2 billion transactions per second. Sigelman is also the co-creator of the OpenTracing standard. Funded by Redpoint, Sequoia, Cowboy Ventures and Harrison Metal, the company’s mission is to “deliver insights that put organizations back in control of their complex software applications.” Sigelman believes that the combination of massive scale and the complexity of microservices-based applications has driven the need for a new generation of application performance monitoring tools and standards.
LightStep came out of stealth in November 2017 and has benefited from the spread of OpenTracing as an instrumentation standard. Interestingly, although the solution was developed originally for new-age technology businesses such as Google, Lyft and others, the company discovered that large established enterprise businesses are also struggling with scale, as well as the need to integrate legacy systems with newer microservice-based applications and cloud computing infrastructure. LightStep customers include Lyft, Twilio, GitHub, UnderArmour, DigitalOcean and others.
LightStep [x]PM – A Closer Look
How does it work?
LightStep [x]PM is “engineered to see everything” and “always on” in production even for microservices-based applications and distributed architectures at scale. The solution provides distributed data collection and statistical analysis for insights into application performance—monitoring and diagnosing transactions across web, mobile, legacy infrastructure, microservices, and serverless functions.
LightStep [x]PM ingests data from native integration with the OpenTracing standard, other tracing community open source projects, data-logging applications or mesh technologies, and load-balancers (Envoy, linkerd, nginx, and haproxy and others). The LightStep SaaS solution captures data on locally deployed “Satellites” deployed within the application’s datacenter or VPC, but not on the application’s VM like a traditional agent. This allows for very high-bandwidth data collection and on-premises scrubbing and aggregation without risk of CPU or memory overhead in the application itself. The Satellites apply statistical analytics to captured data in-memory, working in tandem with the LightStep Engine (running as a SaaS in the cloud) to deliver value through the user-facing product. The LightStep Engine is the transaction analyzer that assembles the traces captured by its Satellites into end-to-end, cross-service analyses; it also facilitates LightStep [x]PM’s real-time exploratory feature set and powers the monitoring and alerting functionality that makes LightStep [x]PM useful during emergencies. This fine-grained approach enables early detection of problems and easier root cause analysis. (see Figure 1, below)
LightStep [x]PM can monitor applications as well as other things such as customer performance management, product experiments and software releases.
Figure 1 – LightStep Architecture
Source: LightStep 2018
All Transactions All the Time
Because the solution was designed specifically for Google-type scale (trillions and trillions of requests), LightStep [x]PM is able to monitor every transaction and every request at scale. This is possible because the solution doesn’t use “heavy” agents that actually degrade application performance. Rather, they use a decentralized two-tier satellite architecture that scales horizontally and stores data in memory as independent virtual instances that don’t run where the customer application is running. Unlike other solutions that use a “sampling” technique to minimize impact on overall system performance, LightStep [x]PM monitors every single transaction so no potential problem is missed.
LightStep [x]PM’s end-to-end traces show full transactions from web and mobile clients, backend microservices, from private cloud to public cloud. LightStep [x]PM pinpoints actual instances of latency issues and errors by computing statistics on 100% of transactions across all services, providing a detailed root cause analysis to identify issues quickly and reduce MTTR (Mean Time to Resolution). In addition, the analysis is very flexible and can be tailored to the needs of a specific customer, whether that is focusing on a particular aspect of the system, a particular product or feature or an individual customer.
Latency histograms offer detailed analysis and interactive granular filtering in real-time, enabling users to isolate and zoom in on any aspect of the application. Example traces, matching specified criteria, are available for even for the most anomalous behaviors. (see Figure 2, next page)
Historical diagrams provide guidance to determine what’s normal for a given situation, and to identify when the anomalous behavior began and what may have triggered it. Administrators can easily understand when and where performance is improving or declining, as well as determine how application changes are affecting performance.
Figure 2 – Latency Histogram – Filtering and Example Traces
Source: LightStep 2018
LightStep [x]PM Major Features:
- Full fidelity timeseries and embedded distributed traces provide speedy root-cause analysis
- Employs point-in-time analysis or historical (1hour, 1 day, 1 week etc.)
- Any service, operation, or tag, can be graphed and monitored
- Can be used to firefight ongoing outages or latency issues, optimize existing applications, or to test applications prior to release in staging environments and alongside canaries in order to quickly detect potential issues
- Role-based access
- Customizable alerts based on latency, error or throughput provide real-time example traces, to visualize, identify, and resolve issues on a customer by customer basis
- SLAs can be defined to capture specific metrics and information
- Automated critical path detection identifies problems proactively and enables focused performance optimization
- Best-in-class performance analytics can be embedded into dashboards and workflows via webhooks and turnkey integrations with Grafana, Slack, and PagerDuty.
Customer success stories
Lyft, a well-known ride-sharing business, relies heavily on the functionality and speed of their mobile application. With greater than one million people taking rides daily, performance is extremely important. Even the smallest lapse or delay in mobile application performance leads to lost revenue. The industry is very competitive and rates between ride sharing companies often don’t differ by much. From the words of Lyft’s President of Engineering, Pete Morelli, “The bigger you get, the better you have to be. Half an hour of downtime may have cost you five rides early on, now it costs millions of dollars in rides. The level of reliability expected of Lyft is not trivial. People are riding to work or to doctors’ appointments.”
By moving monoliths to a distributed architecture using Lightstep, Lyft can meet the demands of its customers without downtime. According to Morelli, “LightStep is the future of monitoring and was instrumental in our move to microservices.” Morelli goes on to say, “Our systems generate more than 100 billion microservice calls per day. LightStep is one of the only systems that can make sense of that firehose: it jumps to the root cause of performance problems anywhere from mobile all the way to the bottom of our distributed stack.” With Lightship, Lyft has increased the efficiency in customer ride routes and accelerated response times by 60 percent, lowered root cause analysis time by 60%, and ensured end-to-end performance management across systems (mobile, micro services, and monolith systems). With the help of LightStep, Lyft can meet the demands of their current customers, while planning for the growing community they will support as more people start to use their platform.
Twilio, a leading cloud service provider (CSP) offers a cloud communications platform for building SMS, voice and messaging applications via a specialized API built for global scale. As a result, software performance and reliability are extremely important to the brand’s reputation. With services that are complicated and plentiful (more than 40 core services) performance issues must be diagnosed, analyzed and fixed quickly so they don’t impact customers. By adopting LightStep, the company was able to analyze all of its performance data without incurring any overhead, while also getting a macro view of the system’s components. David Dunstan, Director of Insight Engineering, Twilio, observed, “The challenge is finding those insights in large sets of data. With LightStep’s ability to parse through high-cardinality data, LightStep [x]PM puts those data sets at our fingertips and has proven to be a critical technology deployment for us.”
Because performance issues and anomalies can be quickly identified and resolved, MTTR for production issues has been improved upon by 92%. LightStep also offers a quick time-to-value. In fact, by implementing LightStep [x]PM, Twilio found issues in their first hour of use that led to a 70% reduction in latency. Now, with new information from LightStep visualizations, the Insight Engineering team has saved 20 hours a week that they can spend elsewhere and not on monitoring. Finally, for Twilio top-tier customers, segmented and detailed performance monitoring and root cause analysis has been added, helping Twilio quickly gain the confidence and trust of their highest revenue-generating customers.
Additional information on LightStep customers is available here.
LightStep is taking a different approach to APM, offering a solution that employs distributed tracing through the OpenTracing standard (along with others who support OpenTracing such as Jaeger and ZipKin)—ideal for the scale of today’s complex applications that generate volumes and volumes of data. LightStep [x]PM tracks 100% of transactions across all services, and employs a statistical engine that can quickly and conclusively identify issues and anomalies.
It remains to be seen whether or not traditional APM vendors will fully support the OpenTracing standard though vendors such as New Relic, Datadog and Instana have all signed on to the project. In fact, some argue that OpenTracing isn’t really a standard, but rather it is an API. Call it what you will. OpenTracing has the support of the CNCF with 236 members as of June 2018— members including Amazon AWS, Dell, Azure, Cisco, Docker, RedHat, VMware, Twitter and Google Cloud. Beyond that, LightStep’s impressive customer list and the tangible results those customers are seeing illustrates the value of the solution for a variety of use cases across a range of industries. For these reasons, Clabby Analytics believes that both established enterprises as well as technology-driven start-ups should look closely at LightStep [x]PM.