Advanced Performance Engineering: Mastering Custom Metrics, Stress Testing, and Web Vitals with k6

In the modern landscape of distributed systems, standard load testing is no longer sufficient. As applications migrate to complex microservices architectures hosted on Kubernetes, the difference between a "passing" test and a "resilient" system often hides in the nuances of user experience and downstream service interdependencies. This article—the third in a four-part series—explores the transition from basic load testing to sophisticated performance engineering using the k6 framework.

Following the successful implementation of a layered test suite against Google’s Online Boutique on a Kubernetes homelab, this installment dives into the unresolved challenges identified in previous testing cycles: executing a stress test, deciphering a Cumulative Layout Shift (CLS) of 0.117, and leveraging k6’s custom metric types to gain granular observability.

The Architecture of Stress: Beyond Simple Load

While a standard load test aims to validate performance under expected traffic, a stress test serves a more critical diagnostic purpose: identifying the "breaking point" and understanding the specific geometry of system degradation.

The stress test configuration employed here utilized a ramping virtual user (VU) strategy:

  • Ramping Phase: 50, 100, and 150 VUs over 2-minute intervals.
  • Peak Hold: Sustaining 150 VUs before stepping down.
  • Threshold Strategy: Rather than enforcing strict Service Level Objectives (SLOs), the thresholds were deliberately relaxed. This allows the system to exhibit its natural degradation patterns rather than failing immediately, which provides richer data for performance engineers.

Decoding the Degradation Curve

The stress test yielded a fascinating result: the homepage remained performant throughout the ramp, while the product browse pages exhibited significant latency at the 150 VU mark. This divergence is highly informative. Because both pages reside within the same frontend service, a global bottleneck would have impacted both equally.

The root cause lay in the call graph. The homepage performs a single, lightweight call to the catalog service. Conversely, the product page triggers a fan-out process: fetching product details, requesting recommendations, and performing currency conversion. Under high concurrency, the recommendation and currency services began queuing requests. This highlights a fundamental truth in microservices: the depth and complexity of your call graph determine your resilience far more than the raw capacity of your frontend.

The Metric Shift: group_duration vs. http_req_duration

A common pitfall in performance monitoring is relying solely on http_req_duration. During our stress test, individual HTTP requests to the frontend completed relatively quickly. The "slowness" was hidden in the aggregate wait time caused by the downstream fan-out.

This is where group_duration proves indispensable. By wrapping the entire "browse product" user journey in a group() function, k6 automatically tracks the end-to-end wall-clock time. While individual requests might look healthy, the group duration metric captures the reality of the user’s experience. Shifting from infrastructure-centric metrics to user-journey metrics is the hallmark of mature SRE practices.

Strategic Observability: The Four Pillars of Custom Metrics

k6 provides four distinct custom metric types, each serving a specific analytical requirement. Misusing these types can lead to misleading dashboards and false confidence.

1. Trend: Analyzing Distributions

Used for tracking latency distributions (e.g., p95 response times), Trend metrics provide the most accurate picture of the user experience. By setting isTime to true, developers can ensure that millisecond values are formatted correctly, allowing for easy identification of "the slowest 5% of users."

2. Rate: Quantifying Success

The Rate metric is essential for measuring success/failure ratios. Unlike raw counts, a Rate provides a percentage that can be directly mapped to business SLOs. For example, ensuring that at least 80% of checkout attempts succeed is a clear, actionable business goal.

3. Counter: Tracking Magnitude

While a Rate provides proportion, a Counter provides magnitude. In a high-traffic environment, a 0.5% error rate might seem negligible, but if the absolute number of failures (the Counter) equates to hundreds of lost sales per hour, the severity of the issue changes.

4. Gauge: Capturing Instantaneous State

The Gauge metric records the most recent value, making it ideal for point-in-time observations. Whether monitoring the number of active sessions or the current queue depth, Gauge provides a snapshot of the system’s immediate health.

The Browser Module: Uncovering Web Vitals

The most glaring issue identified in the previous testing phase was a Cumulative Layout Shift (CLS) score of 0.117. Standard HTTP load testing is blind to this, as it does not render the page. To diagnose this, we integrated the k6 Browser module—a tool that utilizes a real Chromium instance.

The Reality of CLS

CLS measures the visual stability of a page. A score of 0.117—surpassing the "good" threshold of 0.10—indicated that content was shifting during the initial load. In the Online Boutique example, this was caused by product images loading asynchronously into containers without predefined dimensions.

This is a classic "User Experience vs. Engineering" gap. An HTTP test confirms the data arrived; a browser test confirms the user saw it correctly. By adding explicit width and height attributes (or CSS aspect-ratio declarations), we can force the browser to reserve the required space, eliminating the layout shift. This simple fix significantly improves the user experience, proving that performance testing is as much about layout stability as it is about server response times.

Defensive Implementation

Using the browser module requires a more disciplined approach to coding. Two critical practices must be observed:

  1. Asynchronous Handling: Every browser API call (e.g., page.goto, page.title) must be awaited. Failure to do so leads to testing "ghosts" where the script validates the Promise object rather than the actual content.
  2. Resource Lifecycle Management: Using try/finally blocks is mandatory to ensure that Chromium instances are closed even if an assertion fails. Neglecting this will result in memory leaks that eventually crash the test runner.

Implications for Modern SRE

The journey from simple load testing to comprehensive performance engineering is about unifying the observability story. By streaming results from k6 run directly into platforms like Grafana Cloud, performance data becomes a first-class citizen in the existing observability stack.

The ability to use the same script for local development, load testing, and synthetic production monitoring is the ultimate goal of "Shift Left" performance testing. When developers can use the same metrics, query languages, and alerting systems across the entire lifecycle, the barrier between "it works on my machine" and "the service is reliable in production" begins to dissolve.

Conclusion and Future Outlook

The stress testing and metric instrumentation discussed here represent a shift from reactive troubleshooting to proactive performance management. By understanding how call graph complexity drives degradation and how browser-based testing captures the user’s visual reality, engineers can build systems that are not just fast, but inherently stable.

In the final part of this series, we will explore the transition from testing to production monitoring, demonstrating how to turn these scripts into permanent synthetic probes. As we move toward a world where performance is measured by the quality of the human experience, the tools and techniques discussed here provide the foundation for building the next generation of resilient, high-performance web applications.


For those interested in the implementation, all code, configurations, and test scenarios can be found in the k6-playground repository.