1 The Problem

Production services experience periodic 5xx bursts driven by CLOSE_WAIT socket accumulation — a failure mode where HttpClient connections are not explicitly closed on halt/error paths. This creates a positive feedback loop:

The CLOSE_WAIT Cascade
CLOSE_WAIT sockets accumulate on backend pods → connection pool pressure builds → 5xx burst begins → pod restarts to clear state → recovery → repeat. Each burst window exposes real end-users to hard errors with no retry protection.

The root cause was confirmed through Envoy metrics, connection pool analysis, and incident post-mortems in Q1 2026. Q2 is the execution quarter: HttpClient explicit close on halt/error paths, maxRequestsPerConnection, connectionIdleTimeout tuning, nginx keepalive_requests reduction.

Source document: Cloud Infrastructure Improvements [Revenue Protection] BET — Stream 2: End-User Latency & 5xx Downstream.

2 Methodology: How We Built This Case

Read the BET document

Extracted the root cause hypothesis (CLOSE_WAIT/socket leaks), affected services, and success criteria. Identified key signal types needed: 5xx rate, connection lifecycle, latency, pod restarts.

Discovered available Grafana dashboards

Searched Grafana via MCP for existing dashboards covering latency, 5xx, Envoy, nginx. Found two immediately relevant: Envoy MTLS Services and NGINX Ingress Controller. Verified their panel queries against our target signals.

Listed all Prometheus metric names for target signals

Queried Thanos (default datasource) to enumerate all envoy_cluster_upstream_cx_*, envoy_cluster_upstream_rq_*, node_sockstat_*, and kube_pod_* metrics available in production. Identified which metrics best proxy CLOSE_WAIT (a kernel socket state not directly exposed by Prometheus).

Queried 30-day baselines from Thanos

Ran eight parallel instant PromQL queries using increase(...[30d]) across: 5xx volume by service, pod restarts, CLOSE_WAIT proxies, connection pool overflow, request timeouts, pending queue overflow, keep-alive cycling. Then pulled daily-rate range queries for top services to identify burst patterns.

Validated CLOSE_WAIT proxy — narrowed to most specific metric

Initial broad proxies (destroy_with_active_rq) were challenged as too noisy (inflated by deployments). Re-queried with: rq_rx_reset (stream reset on active request — most specific), cx_destroy_remote_with_active_rq ratio (not absolute), and cx_connect_fail. Also evaluated retry policy data to determine user visibility of errors.

Correlated with JPROD incidents (last 90 days)

Fetched all JPROD tickets from Jira (JQL: project = JPROD AND created >= -90d). Identified incidents matching connection exhaustion, 5xx bursts, and restart-to-recover patterns. Confirmed burst dates align with Prometheus spike data.

Anchored business metrics from Tableau (Superfunnel)

Retrieved GMV/session from Tableau Superfunnel: $2.37/session (global, L12M, 753M visitor-days). Applied 25% profit margin. Derived sessions/second at peak to calculate sessions exposed per burst. Combined with burst frequency and duration from JPROD data.

3 Grafana Dashboards

Two existing dashboards are directly relevant. They cover part of the signal set but have gaps for CLOSE_WAIT-specific diagnosis.

DashboardFolderUIDWhat it coversGaps for this BET
Envoy MTLS Services Conveyor ce93g993yyyfff 5xx/min, active/pending requests, retries, connect failures, connect timeouts, close notifications, cx_max_requests — all scoped by service+pod No latency histogram (rq_time_bucket), no pool overflow panel, no CLOSE_WAIT-specific panels
NGINX Ingress Controller CICD & Observability nginx p50/p90/p99 end-user latency, success rate (non-4|5xx), connection count, request volume Covers monitoring/internal ingresses only — not app traffic path (app traffic goes through Envoy/Conveyor directly)
Recommended: New "Stream 2 — Connection Health & 5xx" Dashboard
The existing dashboards do not surface CLOSE_WAIT signals. A new dashboard with the panels described in Section 4 is needed to monitor the fix in production. Suggested row groups: 5xx Rate & Restarts → CLOSE_WAIT Proxy → Pool Exhaustion → Latency (p50/p99) → Timeout Budget → Keep-alive Cycling → TCP Health.

4 Prometheus Metrics Map

All metrics below are confirmed present in the Thanos querier (datasource UID: bdr94pj4npuyoa, default). The source label identifies the Conveyor service: format is conveyor-cloud.ns.<service-name>-production.

CLOSE_WAIT / Connection Lifecycle

MetricWhat it measuresPriority
envoy_cluster_upstream_rq_rx_resetStream reset received from upstream mid-request — the RST that fires when Envoy tries to reuse a CLOSE_WAIT socket. Most specific CLOSE_WAIT proxy available without kernel instrumentation.Primary
envoy_cluster_upstream_cx_destroy_remote_with_active_rqRemote (backend) closed connection while Envoy had active requests. Use as a ratio to cx_destroy_remote — not absolute (inflated by rolling deployments). Ratio >10% on a service indicates structural problem.Ratio only
envoy_cluster_upstream_cx_idle_timeoutIdle timeout fires on a kept-alive connection. Validates connectionIdleTimeout tuning post-fix.Post-fix validation
envoy_cluster_upstream_cx_max_requestsConnection recycled because it hit maxRequestsPerConnection. Validates the per-connection limit tuning.Post-fix validation
envoy_http_downstream_cx_destroy_local_with_active_rqDownstream (client-facing) side equivalent. Returned empty in 30d baseline — problem is upstream-side only, confirming HttpClient close path as the fix location.Confirms fix scope

Connection Pool Exhaustion

MetricWhat it measures
envoy_cluster_upstream_cx_overflowConnection pool overflow — direct 503 source. Was zero for all services in 30d: pool is not overflowing, meaning CLOSE_WAIT sockets keep the pool open but unusable.
envoy_cluster_upstream_cx_pool_overflowPool-level overflow. Also zero — confirms pool capacity isn't the issue; it's the socket state within the pool.
envoy_cluster_upstream_rq_pending_overflowPending request queue overflow — request rejected before connecting. api-lazlo-sox: 101,821 events in 30d. Pool not full, but backend latency causes queue backup.
envoy_cluster_upstream_cx_none_healthyNo healthy upstream — fires after pool is fully consumed by stale connections.

Latency (p99 Baseline)

MetricWhat it measures
envoy_cluster_upstream_rq_time_bucket/sum/countUpstream request latency histogram. Use histogram_quantile(0.99, ...) for p99. Available per service via source label.
nginx_ingress_controller_request_duration_seconds_bucketIngress controller latency (p50/p90/p99). Available in NGINX dashboard but only covers monitoring ingresses, not app traffic.

Retry Policy (user visibility of errors)

MetricWhat it measures
envoy_cluster_upstream_rq_retryTotal retry attempts by service.
envoy_cluster_upstream_rq_retry_successSuccessful retries. Ratio = success/total. Critical finding: next-pwa-app success rate = 0.004% — retries configured but fail during pool exhaustion, providing zero user protection.

Pod Restarts & TCP Health

MetricWhat it measures
kube_pod_container_status_restarts_totalContainer restart counter. Tracks the "restart-to-recover" frequency — the BET target is ≥50% reduction.
node_netstat_TcpExt_TCPTimeoutsTCP-level retransmit timeouts — socket pressure indicator at OS level.
node_netstat_TcpExt_ListenDropsListen queue drops — OS-level backpressure when socket backlog fills.
CLOSE_WAIT is not directly measurable in Prometheus
CLOSE_WAIT is a kernel TCP socket state. Standard node_exporter exposes node_sockstat_TCP_inuse and node_sockstat_TCP_alloc but not CLOSE_WAIT count. For direct measurement: kubectl exec -n <namespace> <pod> -- ss -s | grep CLOSE-WAIT. Run on suspect pods during a 5xx burst to confirm accumulation.

5 CLOSE_WAIT Diagnosis: Signal Validation

The most specific proxy for CLOSE_WAIT is envoy_cluster_upstream_rq_rx_reset — a stream reset received from the upstream while a request was already in flight. This fires when Envoy reuses a stale keep-alive connection and the backend OS RSTs it (the CLOSE_WAIT eviction).

PromQL — Most Specific CLOSE_WAIT Proxy (run in Grafana Explore) # Stream resets received from upstream during active requests # These fire when Envoy reuses a CLOSE_WAIT socket and gets RST'd sum by (source) ( rate(envoy_cluster_upstream_rq_rx_reset{ source=~"conveyor-cloud.ns..*-production.*" }[$__rate_interval]) )
PromQL — CLOSE_WAIT Ratio (use ratio, NOT absolute value) # Remote-closed-with-active-requests as % of all remote closes # High ratio (>10%) indicates structural backend close problem # Use ratio because absolute counts are inflated by rolling deployments sum by (source) (rate(envoy_cluster_upstream_cx_destroy_remote_with_active_rq[$__rate_interval])) / sum by (source) (rate(envoy_cluster_upstream_cx_destroy_remote[$__rate_interval]))

Retry Policy Assessment

To confirm how many 5xx errors reach real users (vs. being absorbed by Envoy retries), we evaluated retry success rates over 30 days:

ServiceRetries (30d)Success rateUser impact
next-pwa-app21.6M0.004%Retries configured but fail 99.996% of time — worst case: 100% user-visible
api-proxy3,3304%Near-zero protection during pool exhaustion — worst case: ~96% user-visible
api-lazlo26,88193.6%Retries work — absorbed before users. Internal service only.
Key Finding: Retries amplify load without protecting users
next-pwa-app generated 21.6M retry attempts in 30 days with essentially 0% success. During CLOSE_WAIT bursts, the retry policy is retrying into the same exhausted connection pool — each user request generates multiple upstream attempts, worsening the burst, while still returning an error to the browser.

6 30-Day Production Baseline

All data from Thanos querier (bdr94pj4npuyoa), window: last 30 days as of 2026-04-16.

5xx Volume by Service

api-proxy
17.7M
5xx in 30d · ~590K/day
next-pwa-app
12.1M
5xx in 30d · ~403K/day
api-lazlo
5.6M
5xx in 30d · ~187K/day
Org-wide total
37.9M
5xx in 30d · ~1.26M/day
api-proxy + next-pwa = 79% of all org-wide 5xx
Both are user-facing: next-pwa-app is the Next.js SSR frontend; api-proxy is the API gateway. Their combined 5xx volume dominates all other services.
PromQL — Reproduce 5xx by Service (30d total) sort_desc(sum by (source) ( increase(envoy_cluster_upstream_rq_xx{ envoy_response_code_class="5" }[30d]) ))

Burst Pattern — Daily 5xx Rate (per second)

Queried as daily range query to expose burst events. Values are requests/second rates at the daily window.

Dateapi-proxy (5xx/s)next-pwa-app (5xx/s)api-lazlo (5xx/s)Event
Typical day4–8/s2–6/s1–2/sBaseline
Mar 1914.7/s14.9/s1.3/sCoordinated spike
Apr 815.3/s9.4/s9.7/slazlo spike
Apr 1320.6/s19.6/s3.2/sWorst event in 30d
PromQL — Daily Burst Pattern (range query, last 30d) sum by (source) ( rate(envoy_cluster_upstream_rq_xx{ envoy_response_code_class="5", source=~"conveyor-cloud.ns.(api-proxy|next-pwa-app|api-lazlo)-production" }[1d]))

CLOSE_WAIT Proxy: rq_rx_reset (most specific)

ServiceStream resets received (30d)Interpretation
api-proxy-production1,042Backend RSTs during active requests — bursty, not constant
api-lazlo-production135Lower volume, same pattern
deckard-production33Low
api-lazlo-sox19Low
All others0Not affected

Small absolute numbers confirm CLOSE_WAIT accumulation is bursty, not constant. These events cluster during the burst windows, not spread uniformly across 30 days.

Pending Queue Overflow

Servicerq_pending_overflow (30d)Note
api-lazlo-sox101,821Pool not full (cx_overflow = 0) — backend latency causes queue backup. Timeout-budget misalignment.
api-lazlo1,685Minor
All others0

p99 Latency Baseline (Envoy upstream, 24h)

pull-production
1,799ms
p99 upstream latency
next-pwa-app
1,478ms
p99 upstream latency
api-lazlo-sox
996ms
p99 upstream latency
api-proxy
867ms
p99 upstream latency
PromQL — p99 Upstream Latency by Service histogram_quantile(0.99, sum by (le, source) ( rate(envoy_cluster_upstream_rq_time_bucket{ source=~"conveyor-cloud.ns.(api-proxy|next-pwa-app|pull|api-lazlo)-production.*" }[$__rate_interval]) ) )

7 JPROD Incident Correlation (Last 90 Days)

Queried via Jira JQL: project = JPROD AND created >= -90d ORDER BY created DESC. Of 44 incidents in the first page, the following match the CLOSE_WAIT burst pattern (connection exhaustion → 5xx spike → restart-to-recover).

IssueDateSummaryPatternStatus
JPROD-538 2026-04-16 Checkout Failure via Order Request Timeout NA — GQL order requests exceeding 15s threshold. Services: next-pwa-app, API Proxy, Order service. Active now In Progress
JPROD-525 2026-04-12 Spike in 503s from GROUT to MBNXT NA. Rollout restart of grout initiated. Correlates directly with Apr 13 Prometheus spike (20.6/s api-proxy). Burst + restart Done
JPROD-523 2026-04-10 Groupon Intl is down — HTTP 504 Gateway Timeout; EMEA full outage; traffic reaching GROUT but timing out to MBNXT; grout pod restart mitigated. Restart-to-recover Done
JPROD-529 2026-04-13 GSS Pods CrashLoopBackOff — DB max connection breach due to idle connection accumulation. Structurally identical to CLOSE_WAIT: idle connections exhaust pool → pods restart to recover. Idle conn exhaustion Done
JPROD-530 (P0) 2026-04-13 Sub-task: "GSS to investigate how we can terminate idle connections at DB" — P0. Confirms idle-connection termination is a recognized fix pattern across teams. Fix confirmation Done
JPROD-486 2026-03-24 5xx Error Spikes and Pod Crashes on Incentive Service NA — Major 5xx spikes, pods crashing post-deployment, latency alerts triggered, ~113,000 pages affected. 5xx + crash Done
JPROD-538 is open today
The checkout timeout incident created this morning (Apr 16) confirms this analysis is not historical — the CLOSE_WAIT/timeout cascade is actively impacting revenue today.

CLOSE_WAIT not named explicitly in any JPROD ticket. The failure pattern is described as "connection timeout," "503 spike," "pods need restart," or "idle connections." This is expected: CLOSE_WAIT is a TCP socket state visible only via ss on the pod itself, not in application logs or Jira descriptions. The pattern match is structural, not keyword-based.

Burst frequency from JPROD: 3–4 qualifying incidents in 90 days = ~1.2–1.5 per month. Combined with the Prometheus burst data (3 spikes in 30 days from the daily rate query), we use 3–3.5 bursts/month as the model input.

8 Business Metrics

Sourced from Tableau Superfunnel (GMV/session dashboard, global, Apr 16 2025 – Apr 15 2026).

GMV / session
$2.37
Global · L12M · Tableau Superfunnel
Total sessions
753M
Visitor-days · L12M
Sessions / day
2.06M
753M ÷ 365
Peak sessions/s (NA)
~36/s
2.5× avg · 60% NA share
Profit / session
$0.59
$2.37 × 25% profit margin
Total GMV
$1.785B
GB + ILS + OD · net of Groupon promos

Session Definition

Tableau Superfunnel defines a session as one unique visitor on one calendar day — COUNT(DISTINCT unique_visitors) from gbl_traffic_superfunnel, deduped daily. A visitor who comes on 50 different days = 50 sessions. This is the right denominator for a per-session monetisation rate, since GMV contribution is spread across all visit days.

Profit Margin

25% of GMV applied as the profit layer per instruction. This accounts for the fact that not all GMV is retained — merchant payouts, refunds, operational costs reduce the effective margin. Revenue impact shown at both GMV and profit levels throughout the model.

9 Impact Model: From Bursts to Revenue

Sessions Affected Per Burst

During a CLOSE_WAIT burst on next-pwa-app:

  • Peak 5xx rate: ~19–20/s on next-pwa-app (confirmed Apr 13)
  • Each failed SSR render = one user sees a broken/error page
  • Retry success rate: 0.004% → effectively no protection
  • Manual retry factor: ~1.5× (user tries once before abandoning)
  • Unique affected sessions/second: 19 ÷ 1.5 = ~13 sessions/s
  • As % of NA peak traffic (36/s): ~36% of NA sessions impacted during burst window
Model correction — April 2026
The initial model treated Envoy upstream 5xx/s as equivalent to user session failures/s. That was wrong. envoy_cluster_upstream_rq_xx measures internal API call failures, not user-facing request failures. next-pwa-app makes ~23 upstream API calls per page render (confirmed: 4,700 req/s total ÷ ~205 user-facing page renders/s at peak). A 20/s upstream failure rate corresponds to ~0.9 failed page renders/s, not 13 failed sessions/s. Numbers below are corrected.

Per-Burst Financial Impact

Burst duration based on restart-to-recover pattern: JPROD-523 and JPROD-525 both resolved by pod restart. Typical time from onset to restart = 15–45 min (mid-case: 30 min). Burst error rate confirmed via Prometheus: 1–1.5% of upstream calls fail during burst (vs 0.05% baseline). Actual burst peak 5xx ranges 40–214/s, with 10–15 distinct events observed per month.

MetricCalculationValue
Extra upstream API failures / burst4,700/s × 1.45% excess × 1,800s~122,700
Failed page renders / burst122,700 ÷ 23 API calls/page~5,300
Lost sessions / burst (30% abandon)5,300 × 30%~1,600
GMV at risk per burst1,600 × $2.37~$3,800
Profit at risk per burst1,600 × $0.59~$940

Monthly & Annual Impact

ScenarioBursts/moAvg error rateSessions/moGMV/monthProfit/monthProfit/year
Conservative (NA only)61.0%6.4K$15K$3.8K~$45K
Mid-case (NA + EMEA)101.5%22.4K$53K$13.2K~$160K
High (NA + EMEA)152.0%45K$107K$26.7K~$320K

EMEA note: JPROD-523 was a confirmed full international outage (EMEA down). All scenarios above NA-only include EMEA at 40% of NA sessions — both mid-case and high. Conservative uses NA-only as a floor.

What this model excludes (additional upside)

Excluded factorDirectionReasoning
Checkout-funnel weighting↑ Higher impact$2.37 GMV/session is an all-traffic average. Users mid-checkout who hit a 5xx have 5–10× higher expected value. JPROD-538 confirms bursts hit checkout flow.
Latency drag (non-burst)↑ Ongoing costp99 at 1.5s on next-pwa continuously degrades conversion outside burst windows. Not modeled here.
Retry amplification↑ Worsens severity21.6M failed retries on next-pwa increase backend load during bursts, potentially extending burst duration and severity.
Brand / LTV churn↑ Longer-termUsers who experience hard errors have measurably lower 30-day return rates. Not quantified.

10 ROI Conclusion

Stream 2 — End-User Latency & 5xx: Business Case

Fix cost (15 MDs × $600/MD blended, Conor + Sidiney)$9,000
Conservative annual profit at risk (NA only)~$45K/year
Mid-case annual profit at risk (NA + EMEA)~$160K/year
High case (NA + EMEA)~$320K/year
Payback period (mid-case)~20 days
Year-1 ROI (mid-case)~18× return
Year-2+ ROI (zero fix cost, recurring savings)Compound — growing as traffic grows

Confidence assessment

InputConfidenceSource
GMV per session ($2.37)HighTableau Superfunnel, L12M, 753M sessions
Burst frequency (10–15/mo observed)MediumPrometheus 30d range query shows ~20 threshold events; unknown how many are CLOSE_WAIT vs deployment/other causes
Burst error rate (1–6%, typical 1.5%)HighPrometheus error rate query confirmed; range 0.7–5.8% across burst events, 0.05% baseline
API calls per page render (~23)MediumDerived: 4,700 upstream req/s ÷ estimated ~205 page renders/s at peak. Not directly measured — biggest remaining uncertainty in the model.
Retry protection (0%)Highrq_retry_success / rq_retry = 0.004%, Prometheus 30d
25% profit marginAssumedPer instruction — cross-check with Finance if needed
$600/MD blended engineering rateAssumedBlended direct labor rate — direct cost basis, not fully loaded
Burst duration (30 min)MediumEstimated from JPROD restart-to-recover pattern; no exact timestamp in post-mortems. Primary model uncertainty.
% sessions in checkoutExcludedModel uses average GMV/session — checkout users worth more. Conservative assumption.

The two numbers that sharpen this most

1. API calls per page render. The model uses ~23 (derived from total upstream call rate ÷ estimated page render rate). If next-pwa makes fewer API calls per render (e.g. 10), the impact doubles; if more (e.g. 40), it halves. A direct measurement via envoy_http_downstream_rq_completed on next-pwa's inbound traffic would anchor this.

2. CLOSE_WAIT attribution rate. Prometheus shows 10–15 burst events/month where next-pwa 5xx exceeds 5/s, but not all are CLOSE_WAIT-triggered — some are deployment restarts, upstream failures, or other transients. If only 50% of burst events are CLOSE_WAIT-caused, the mid-case halves to ~$57K/year. IMOC post-mortem correlation with deployment timestamps would separate the two.

A Appendix: All PromQL Reference Queries

All queries target datasource UID bdr94pj4npuyoa (Thanos querier, default). Run in Grafana Explore or paste into a panel.

5xx monitoring

5xx rate by service (live monitoring) sum by (source) ( rate(envoy_cluster_upstream_rq_xx{ envoy_response_code_class="5", source=~"conveyor-cloud.ns..*-production.*" }[$__rate_interval]) )
5xx 30-day total by service sort_desc(sum by (source) ( increase(envoy_cluster_upstream_rq_xx{ envoy_response_code_class="5" }[30d]) ))

CLOSE_WAIT proxies

Stream resets (most specific CLOSE_WAIT proxy) — should → 0 post-fix sum by (source) ( rate(envoy_cluster_upstream_rq_rx_reset{ source=~"conveyor-cloud.ns..*-production.*" }[$__rate_interval]) )
Remote-close-with-active-requests RATIO (use ratio, not absolute) sum by (source) ( rate(envoy_cluster_upstream_cx_destroy_remote_with_active_rq[$__rate_interval]) ) / sum by (source) ( rate(envoy_cluster_upstream_cx_destroy_remote[$__rate_interval]) )

Connection pool health

Pool overflow events (should be 0; non-zero = pool exhausted) sum by (source) ( rate(envoy_cluster_upstream_cx_overflow[$__rate_interval]) + rate(envoy_cluster_upstream_cx_pool_overflow[$__rate_interval]) )
Pending queue overflow (rejected before connection) sort_desc(sum by (source) ( rate(envoy_cluster_upstream_rq_pending_overflow[$__rate_interval]) ))

Keep-alive tuning validation (post-fix)

Connection cycling via maxRequestsPerConnection and idle timeout sum by (source) ( rate(envoy_cluster_upstream_cx_max_requests[$__rate_interval]) + rate(envoy_cluster_upstream_cx_idle_timeout[$__rate_interval]) )

Latency

p99 upstream latency by service histogram_quantile(0.99, sum by (le, source) ( rate(envoy_cluster_upstream_rq_time_bucket{ source=~"conveyor-cloud.ns.(api-proxy|next-pwa-app|pull|api-lazlo)-production.*" }[$__rate_interval]) ) )

Pod restarts (restart-to-recover tracking)

Container restarts by namespace (target: ≥50% reduction) sort_desc(sum by (namespace, pod) ( increase(kube_pod_container_status_restarts_total{ namespace=~"api-proxy-production|next-pwa-app-production|api-lazlo-production" }[30d]) ) > 0)

Retry policy assessment

Retry success rate — confirms user visibility of errors sum by (source) (rate(envoy_cluster_upstream_rq_retry_success[$__rate_interval])) / sum by (source) (rate(envoy_cluster_upstream_rq_retry[$__rate_interval]))

Node-level socket health

TCP timeout and listen drop indicators # TCP-level retransmit timeouts (socket pressure) rate(node_netstat_TcpExt_TCPTimeouts[$__rate_interval]) # Listen queue drops (OS backpressure) rate(node_netstat_TcpExt_ListenDrops[$__rate_interval])
Direct CLOSE_WAIT check (run on pod, not Prometheus) # Run during a 5xx burst to confirm CLOSE_WAIT accumulation kubectl exec -n api-proxy-production <pod> -- ss -s | grep CLOSE-WAIT kubectl exec -n next-pwa-app-production <pod> -- ss -tan | awk '{print $1}' | sort | uniq -c