cost-optimizationdevopsSaaS

Audit Your Stack: A DevOps Playbook to Detect Underused SaaS with Logs & Billing

UUnknown

2026-02-23

11 min read

A practical DevOps playbook (with scripts and queries) to find underused SaaS using logs, billing and API probes — and safely kill or consolidate them.

Hook: Your SaaS bills are climbing while feature flags gather dust — here's how to prove what to kill

Tool sprawl and hidden SaaS cost are a predictable drag: recurring invoices, duplicated features, integration maintenance and the mental overhead for teams. If you’re a DevOps or platform lead asked to cut costs without breaking workflows, you need more than opinions: you need repeatable, scriptable audits that surface actual usage, per-feature ROI and technical overlap so you can justify consolidation or sunsetting with data.

What this playbook delivers (fast)

Step-by-step procedures to build a scriptable SaaS inventory from logs, billing exports and repo scans.
Queries and scripts (BigQuery / Athena / ELK / Python / bash) you can run today to quantify usage and cost per action.
API probe patterns to validate live dependencies and surface shadow integrations.
A practical scoring matrix to rank consolidation candidates and an operational checklist for safe decommissioning.

Context: Why 2026 makes this urgent

By 2026, SaaS proliferation has accelerated: more micro-SaaS vendors, AI-native feature add-ons and consumption-based pricing mean bills can spike unpredictably. FinOps practices matured across many orgs in 2024–2025, and major cloud providers expanded granular billing exports and real-time usage APIs. That makes this the year to move from manual guesswork to automated, evidence-based SaaS pruning.

Overview: The audit workflow (inverted pyramid — do this order)

Inventory: Collect a canonical list of all SaaS subscriptions, service accounts, API keys and integrations.
Telemetry: Gather usage logs, billing exports and API usage telemetry into a queryable store.
Probes: Actively test endpoints, webhooks and tokens to confirm live dependencies.
Normalize & Analyze: Map cost to activity and features; detect overlap.
Score & Prioritize: Rank candidates to kill, consolidate, or keep.
Decommission Plan: Migration steps, data retention, and rollback triggers.

1) Inventory: Automated discovery (80% of the battle)

Start with a canonical SaaS inventory. Manual spreadsheets are fragile; automate discovery from three sources:

Billing exports (credit card statements, vendor invoices, cloud marketplace).
Source repositories and IaC (search for SDKs, providers, and env vars).
Secrets stores and CI/CD config (service tokens live here).

Repo & config scan (fast wins)

Use ripgrep or ag to find SDK usage and env var names. Run from your monorepo root and aggregate results.

# Find common provider SDKs and API keys with ripgrep (rg)
rg --hidden --no-ignore-vcs "(AWS|GCP|AZURE|SLACK|SENDGRID|STRIPE|SENTRY|DATADOG|ROLLBAR|NEWRELIC|AUTH0|OKTA)" -S -n

# Scan for environment variables that look like API keys
rg --hidden --no-ignore-vcs "(API_KEY|_TOKEN|_SECRET|CLIENT_ID|CLIENT_SECRET|SERVICE_ACCOUNT)" -S -n

Export results to CSV and correlate with billing records. This uncovers hidden integrations and developer experiments.

Secrets & CI systems

Query your secrets manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) for key names. Many vendors use standard prefixes — this helps find service accounts used only in pipelines.

2) Telemetry: Ingest billing & usage data into a single store

Combine three streams into one analytics layer: billing exports, SaaS provider usage APIs, and your own request logs. Use an existing data warehouse (BigQuery, Snowflake) or cloud storage + query engine (S3 + Athena).

Common ingestion sources and how to get them

Cloud billing: AWS Cost and Usage Report (CUR) to S3 → Athena; GCP Billing export to BigQuery; Azure Cost Management exports to Storage → Synapse/Azure Data Explorer.
SaaS invoices: Pull vendor invoices via accounting exports (CSV) or via vendor billing APIs where available.
Usage APIs: Many vendors (Datadog, Snyk, Sentry, etc.) expose endpoints for API key usage, seat counts and metered features.
Application logs: Centralize to ELK/Opensearch or Splunk and forward to data warehouse.

Sample SQL: find top SaaS cost centers (BigQuery / Athena style)

SELECT
  service AS vendor,
  SUM(cost) AS total_cost,
  COUNT(DISTINCT invoice_id) AS invoices
FROM
  billing_exports.saas_costs
WHERE
  usage_start BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) AND CURRENT_DATE()
GROUP BY
  service
ORDER BY
  total_cost DESC
LIMIT 50;

Map cost to teams and tags

If you tag invoices or use cost centers in cloud providers, join billing data with your org mapping table. If not, infer team ownership by searching for account emails or project IDs in invoice metadata.

3) Logs: Usage telemetry queries that reveal real users

Logs answer different questions than billing. Billing shows spend; logs show who actually used a feature and when. Look at API call counts, feature flags, webhook deliveries and metric emission.

ELK / OpenSearch query examples

# Find unique users hitting a vendor integration endpoint in the last 30 days
POST /app-logs-*/_search
{
  "size": 0,
  "query": {"bool": {"filter": [{"term": {"path": "/thirdparty/slack"}}, {"range": {"@timestamp": {"gte": "now-30d"}}}]}}
  ,"aggs": {"users": {"cardinality": {"field": "user.id"}}}
}

SQL on events (data warehouse)

SELECT
  integration_name,
  COUNT(*) AS calls,
  COUNT(DISTINCT user_id) AS active_users,
  PERCENTILE_CONT(call_duration, 0.95) OVER (PARTITION BY integration_name) AS p95_ms
FROM
  telemetry.api_calls
WHERE
  integration_name IS NOT NULL
  AND event_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND CURRENT_TIMESTAMP()
GROUP BY
  integration_name
ORDER BY
  active_users DESC;

Key metrics to compute

Active users: distinct users who used a product feature in the last 30/90/365 days.
Cost per active user: vendor monthly cost / active users.
Calls per day: signal for automated integrations.
Peak vs median: to spot over-provisioning or spikes from cron jobs.
Error surface: high error rates on vendor API calls indicate brittle integrations.

4) API probes: Confirm live dependencies with safe probes

Logs show historical usage; probes validate current, possibly undocumented, dependencies. Build ephemeral probes with a low blast radius and rate limits.

Probe patterns

Token introspection: Use vendor APIs to list active API keys and last-used timestamps.
Webhook delivery audits: Check vendor webhook management APIs and verify destination success rates.
Endpoint reachability probes: Regularly call integration endpoints (with test payloads) to confirm they're being hit and measure latencies.

Python example: list active API keys from a hypothetical vendor

import requests

API_BASE = 'https://api.vendor.example/v1'
ADMIN_KEY = 'REDACTED_ADMIN_KEY'

resp = requests.get(f"{API_BASE}/admin/api_keys", headers={"Authorization": f"Bearer {ADMIN_KEY}"})
resp.raise_for_status()
for key in resp.json().get('keys', []):
    print(key['id'], key['last_used_at'], key['owner'])

Use pagination and exponential backoff. Many vendors return last_used timestamps — that's gold for identifying stale keys.

5) Integration discovery: scan for inbound & outbound hooks

Webhooks are the usual culprit for hidden dependencies. Check both sides:

Vendor dashboard — list webhook subscriptions and targets.
Your system logs — search for 2xx responses to vendor IP ranges or user-agents.

Example: find webhook receivers in Nginx logs

# count requests by upstream user-agent over last 30 days
zgrep "vendor-webhook" /var/log/nginx/*access*.gz | awk '{print $12}' | sort | uniq -c | sort -rn

6) Normalize & analyze: Join cost, logs and inventory

Now join the datasets into a single table keyed by vendor and team. Weight metrics to compute a usage score and cost-impact. Example schema columns:

vendor, service_id, owner_team
monthly_cost, invoices_last_12m
active_users_30d, calls_30d, errors_30d
last_api_key_use, webhook_success_rate

Scoring algorithm (practical)

Compute three normalized scores (0–100): UsageScore, CostScore, RiskScore. Then compute CandidateScore = CostScore * (1 - UsageScore) * RiskFactor.

# Pseudocode
UsageScore = normalize(active_users_30d / team_size)
CostScore = normalize(monthly_cost / total_platform_cost)
RiskFactor = 1 if last_api_key_use < 180 days else 0.5 # stale keys reduce immediate risk
CandidateScore = CostScore * (1 - UsageScore) * RiskFactor

# High CandidateScore -> prioritize for consolidation or sunsetting

7) Detect feature overlap and consolidation paths

Feature overlap can be the hardest to quantify. Build a feature matrix with boolean flags and usage counts. Example features: notifications, error-tracking, APM, SSO, secrets management, email delivery.

Practical method

List features for each vendor (manual + vendor docs).
Map events or logs to features (e.g., errors → error-tracking).
For each feature, compute active_users and calls across vendors.
If a single vendor covers 90%+ of feature events vs others covering <10%, consolidation is feasible.

Useful thresholds (opinionated but practical)

Decommission candidate: monthly_cost > $500 and active_users_30d < 5% of expected users.
Consolidation candidate: feature overlap > 60% and combined operational cost > 1.5x a single vendor option.
Keep but review: high cost but high criticality (SSO, secrets, billing).

8) Decommissioning checklist (operational playbook)

Stakeholder communication: announce impact, timeline, owners.
- Map affected teams, SLA changes, and rollback owners.
Data export: export historical data and retention policy (CSV, JSON, or vendor export API).
Migration runbook: cutover steps, integration points to update, migration scripts.
Monitoring: create pre/post health checks (uptime, error rates, user complaints) and set rollback triggers.
Contracts: review termination clauses, notice periods and data deletion timelines.
Post-mortem: capture lessons and update your procurement guardrails.

Real-world example (short case study)

In late 2025, a mid-market SaaS company ran this audit and discovered two underused tools: a separate error-tracking product and a lightweight SaaS that handled internal notifications. The audit found:

Monthly spend: $3.2k combined
Error-tracking: 85% of events were already captured in the APM provider (which had a lower cost per event).
Notifications: only 12 active users (internal engineers) and webhooks had a 95% delivery rate to a general Slack channel — trivial to migrate.

They consolidated error-tracking into the APM and replaced the notification SaaS with internally-managed webhooks. Annual savings: ~$30k and a 30% reduction in alerts maintenance time. The migration took 6 weeks from discovery to full decommission, with zero customer impact because the audit produced exact usage evidence and a tested rollback plan.

Automation recipes & CI checks (make audits routine)

Treat the audit as a pipeline job. Example pipeline tasks:

Nightly job to update SaaS inventory (billing + repo + secrets scan).
Weekly BigQuery/Athena queries to recompute candidate scores.
Alert if a vendor's cost increases > 30% month-over-month or API key last_used > 365 days without recent calls.

Sample GitHub Actions step (pseudo)

name: saas-audit
on:
  schedule:
    - cron: '0 3 * * 1' # weekly
jobs:
  run-audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run inventory scan
        run: bash scripts/inventory_scan.sh
      - name: Upload to BigQuery
        run: python scripts/upload_to_bq.py

Common pitfalls & how to avoid them

Relying on billing alone — bills lack granular feature mapping. Always join with logs and API telemetry.
False positives from stale keys — verify last-used and actual calls before cutting access.
Underestimating internal workflows — interview power users before killing tools used by a small but critical team.
Ignoring contractual obligations — some contracts have minimum terms or data retention rules.

"Decommissioning without data is risky. Decommissioning with telemetry is accountable and reversible."

Advanced strategies for large orgs (scale & governance)

If you manage hundreds of vendors, add these layers:

Feature catalogue: a centralized CMDB-style catalogue mapping vendors to capabilities and SLAs.
Automated dependency graphs: instrument ingestion of CI/CD pipelines, Cloud IAM, and runtime logs to build a service dependency graph (e.g., Neo4j).
Procurement policy enforcement: CI gate that rejects new SaaS unless a business case, owner and tag are provided.
Cost guardrails: Programmatic alerts for per-vendor monthly spend limits using cloud provider budgets and vendor webhooks.

Future predictions (2026 and beyond)

Expect these trends in 2026–2027:

Richer vendor usage APIs: Vendors will provide more granular, machine-readable usage telemetry as customers demand transparent unit pricing.
Real-time FinOps: Real-time billing streams and predictive budgeting will make sudden spikes easier to catch.
Policy-as-code for SaaS: Governance frameworks will extend to SaaS procurement, enabling automated approvals and spend limits.

Actionable takeaways — run this in 30/60/90 days

30 days: Run the repo + secrets scan and ingest last 90 days of billing into a warehouse. Produce a top-20 vendors by spend report.
60 days: Join billing with telemetry, compute CandidateScore, and run API probes for top 10 candidates. Schedule stakeholder reviews.
90 days: Execute 1–2 low-risk decommissions, measure savings and update procurement policy to prevent recurrence.

Call to action

If you want a reproducible starter repository with the scans, BigQuery templates and Python probe scripts used in this playbook, grab the frees.cloud DevOps SaaS Audit repo and run the included GitHub Action. Or reach out to your platform team and propose a 90-day audit sprint — start small, automate often, and make each decommission a data-driven win.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Comparison: Best free hosting setups for short-form AI video apps — limits you should know

legal•11 min read

Privacy-first dataset licensing checklist for sourcing creator content for AI

mini-project•10 min read

Mini-project: Build a recommendation engine for micro-apps using small LLMs and curated creator datasets

monetization•11 min read

Monetization playbook for micro-app creators: subscriptions, dataset licensing and creator payments

devops•10 min read

DevOps snippet pack: CI/CD for micro-apps with free CI, canary deploys and rollbacks

From Our Network

Trending stories across our publication group

Protecting Your Store’s Reputation After a Major Platform Outage: A Communications Toolkit

topshop.cloud

customer-success•10 min read

Protecting Your Store’s Reputation After a Major Platform Outage: A Communications Toolkit

Architecting for Data Sovereignty: Designing Multi-Region Apps for the AWS European Sovereign Cloud

pyramides.cloud

sovereignty•10 min read

Architecting for Data Sovereignty: Designing Multi-Region Apps for the AWS European Sovereign Cloud

Warehouse Automation Landing Page Template: Convert Logistics Leads with Data-First Messaging

one-page.cloud

landing-pages•10 min read

Warehouse Automation Landing Page Template: Convert Logistics Leads with Data-First Messaging

FedRAMP vs EU Sovereignty: Mapping Cross-Jurisdiction Compliance for AI Platforms

numberone.cloud

compliance•11 min read

FedRAMP vs EU Sovereignty: Mapping Cross-Jurisdiction Compliance for AI Platforms

Hosting RISC‑V Inference on Sovereign Clouds: Technical and Legal Considerations

newworld.cloud