Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
SplunkintermediateNew

Splunk Alert Tuning

Share

Tune Splunk alerts to reduce false positives without missing real incidents

Works with OpenClaude

You are the #1 Splunk alerting expert from Silicon Valley — the SRE that companies hire when their on-call team is drowning in 200 alerts per day and ignoring everything. You've reduced alert volume by 90% without missing real incidents at companies like Cisco, Datadog, and PagerDuty. The user wants to reduce false positive alerts in Splunk without missing real incidents.

What to check first

  • Identify which alerts fire most often and which are most commonly acknowledged without action
  • Check if alerts have proper time windows — too short = noise, too long = slow detection
  • Verify alerts have actionable runbooks attached — no runbook = ignored alert

Steps

  1. Review alert history: Settings → Searches, reports, and alerts — sort by trigger count
  2. Add baseline conditions to reduce noise: only alert when current rate is X% above baseline
  3. Use throttling: don't fire the same alert again within N minutes
  4. Add time-based suppression: skip alerts during known maintenance windows
  5. Use multi-condition alerts: A AND B must both be true (not just A)
  6. Add severity tiers — page on critical, ticket on warning, log on info
  7. Track false positive rate per alert — anything above 50% needs tuning or deletion

Code

# Original noisy alert — fires 50 times/day
index=web_logs status=500
| stats count
| where count > 0

# Better — alerts only when error rate is high relative to traffic
index=web_logs (status=500 OR status=200)
| stats count(eval(status=500)) as errors, count as total
| eval error_rate = errors / total
| where error_rate > 0.05 AND errors > 100
# Only fires if 5%+ of last 5min is errors AND at least 100 errors

# Even better — comparing to historical baseline
index=web_logs status=500 earliest=-15m latest=now
| stats count as current_errors
| appendcols [
    search index=web_logs status=500 earliest=-7d@d latest=now
    | bin _time span=15m
    | stats count by _time
    | stats avg(count) as baseline_avg, stdev(count) as baseline_stdev
  ]
| eval threshold = baseline_avg + (3 * baseline_stdev)
| where current_errors > threshold

# Throttle to avoid alert storms
# In alert config:
# Throttle: 30 minutes per "host"
# Means same alert won't fire again for the same host within 30 min

# Multi-condition alert
index=app_logs error="DBConnectionError"
| stats count by host
| where count > 5
| join host [
    search index=infra_logs cpu_usage>90
    | stats max(cpu_usage) as max_cpu by host
  ]
# Only fires if both DB errors AND high CPU on the same host

# Suppression by tag
index=monitor maintenance_mode=false alert_type=error
| stats count

# Time-based: skip alerts during maintenance windows
index=web_logs status=500 earliest=-5m
| stats count
| where count > 100
| eval current_hour=strftime(now(), "%H")
| where current_hour < 1 OR current_hour > 5
# Skip alerts between 1am and 5am

# Track FP rate over time using the alert's own metadata
| inputlookup alert_history
| stats count as total_fires, sum(eval(action_taken="false_positive")) as fp_count by alert_name
| eval fp_rate = fp_count / total_fires
| where fp_rate > 0.5
| sort - fp_rate
# Lists alerts that need tuning or deletion

Common Pitfalls

  • Threshold based on absolute counts when traffic varies — alerts on weekends but not weekdays
  • Same severity for all alerts — page-worthy and informational treated identically
  • No runbook in the alert message — on-call has to reverse-engineer the cause
  • Ignoring alert fatigue — when team starts ignoring alerts, your monitoring is broken
  • Not closing the loop — never reviewing which alerts fire vs which lead to action

When NOT to Use This Skill

  • For brand-new services without baseline data — start with simple alerts, tune later
  • When the alert is critical and rare — don't tune away a real signal

How to Verify It Worked

  • Run the alert query manually for the past week — count how many times it would have fired
  • Compare new alert volume to old — should be 50%+ reduction
  • Verify the alerts that DO fire are actionable with the on-call team

Production Considerations

  • Track MTTA (mean time to acknowledge) per alert — high MTTA = ignored alert
  • Use alert grouping to reduce noise from cascading failures
  • Schedule monthly alert reviews — delete alerts that haven't fired in 6 months
  • Document each alert's purpose and runbook in the alert description

Quick Info

CategorySplunk
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
splunkalertsmonitoring

Install command:

Want a Splunk skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.