Maintenance & SRE | Effektiv

73%

reduction in human pages — ASX-200 retailer, 200 pages a week to 54

8 min

Black Friday CDN fault caught before it reached production — precursor in telemetry, zero human pages

$1.2M

annual run-cost saved at the ASX-200 retail engagement in year one

≤60s

mean time to detect for known fault patterns, versus 4–12 minutes before the triage agents

The structural problem with traditional SRE

Your on-call team is spending 70% of their nights on alerts a machine could close.

Read 12 months of incident logs for a mid-market stack and the same five fault patterns account for the majority of out-of-hours pages. The same alert fires. The same runbook gets pulled. The same engineer types the same commands at 3am. An ASX-200 retailer ran 200 pages a week into a small SRE team. Effektiv read twelve months of their incident logs, extracted the real fix steps from actual resolution data — not from runbooks — and built triage agents around those patterns.

Human pages dropped 73%. Eight minutes before a Black Friday CDN failure would have reached production, the triage agent caught the precursor in telemetry and closed the incident without waking anyone. A vendor priced on seat count has no incentive to reduce the volume of incidents a human handles. Effektiv's retainer is written the opposite way: the bill goes down as the automation rate goes up.

What changes

The same challenge. Two very different outcomes.

Without Effektiv

200 pages a week into a small SRE team
Same five fault patterns re-paged every week
Engineers type the same commands at 3am, every week
Post-mortems in a shared drive nobody re-reads
Mean time to detect 4–12 minutes from precursor to alert
Vendor priced by seat — no incentive to reduce volume

With Effektiv

27% of pages reach a human — the rest auto-close
Triage agents read alerts against 12 months of actual fix data
Known faults close behind a rollback gate without a person in the loop
Resolution database queryable and extendable by your team
Mean time to detect under 60 seconds for known patterns
Retainer goes down as the automation rate goes up — incentives aligned

Dimension	Effektiv agent triage	Vendor priced by seat	Alert-to-jira automation
Incentive alignment	Bill goes down as automation rises	Bill rises with seats	Per-event pricing
Human-page reduction	50–70%	0–10%	15–25%
Rollback gate per step	Yes, named in Design	None	Manual rollback
Mean time to detect	≤60s for known patterns	4–12 minutes	1–3 minutes

How we deliver

Diagnose. Design. Deliver.

Two weeks of listening before a line of code. The price is fixed at the end of Design — not at kick-off.

Phase 1 · 1–2 weeks

Diagnose

We map your incident log, runbooks, and cost telemetry. We read 12 months of actual incident history — the commands engineers actually ran to resolve each fault, not the runbooks people meant to follow. We identify which patterns are candidates for automation and which need a human in the loop by design.

Phase 2 · 1–2 weeks

Design

Triage rig spec, rollback rules, and eval gates. Human-in-the-loop requirements documented. Any fault pattern touching a money write or a record of truth stays gated. All model inference on AWS Bedrock in AU regions, inheriting VPC, IAM, PrivateLink, CloudTrail, and KMS controls.

Phase 3 · 4–8 weeks

Deliver

Triage agents built and tuned in a parallel run alongside your existing on-call process. The switch-over is incremental, not a single cut-over. The outcome contract names the deflect rate and MTTR targets — both measured and reported weekly.

What you walk away with

Everything ships to your team at exit. No lock-in.

🛠

Triage agents in production

Trained on 12 months of your incident history. Your repo, your control. Extendable by your team without us.

🧪

Resolution database

Real fix steps from real incidents. Indexed, queryable, and extendable — not a static runbook.

🗄

Eval gates as code

Triage accuracy, MTTR, false-positive rate, human-page reduction, incident review completion. Runnable code.

📒

Detection latency board

Mean time to detect tracked weekly against the contract target. Visible to your team and ours.

🎓

On-call handover pack

Roles and protocols documented. Your team extends with new fault patterns without calling us back.

Quality gates

What the eval rig measures.

Every output passes a multi-gate evaluation before it merges or ships. Outputs that fail do not proceed. The eval rig and all gate code are yours at exit.

Triage accuracy — correct routing as a percentage of total alerts, threshold agreed in Design
Mean time to detect for known fault patterns — target under 60 seconds
False-positive rate on AI triage decisions — any drift triggers an eval refresh and a paused-automation period
Human-page reduction vs the prior baseline — measured weekly against the contract target
Incident review completion rate — AI agent contributes diagnostics on every paged incident

Eval rig · sample run

Triage accuracy — correct routing as a percentagPASS
Mean time to detect for known fault patterns — tPASS
False-positive rate on AI triage decisions — anyPASS
Human-page reduction vs the prior baseline — meaPASS
Incident review completion rate — AI agent contrPASS

Eval rig source code shipped to your repo at exit.

Common questions

Your on-call team pages a human only when a human matters.

Your on-call team is spending 70% of their nights on alerts a machine could close.

The same challenge. Two very different outcomes.

Three on-call vendor models.

Diagnose. Design. Deliver.

Diagnose

Design

Deliver

Everything ships to your team at exit. No lock-in.

Triage agents in production

Resolution database

Eval gates as code

Detection latency board

On-call handover pack

What the eval rig measures.

Other ways we work with you.

Modernisation

Operations & Integration

AI Adoption

Customer Experience

Software Build

Frequently asked questions.

What does 'AI in the alert path' mean without autonomous remediation?

How does the retainer pricing work and what does it include?

We have a 24/7 on-call roster that is burning out our engineers. How quickly can you change that?

What does the eval rig look for in a Maintenance & SRE engagement?

Can you take over maintenance for a system Effektiv didn't build?

What happens when something goes seriously wrong — a major outage?

See what your on-call stack looks like with AI in the alert path.