IoT Predictive Maintenance on AWS: 40% Downtime Reduction

Challenge: Reactive Maintenance on High-Value Production Equipment

A mid-size manufacturer of precision industrial components operated 14 production lines across two facilities, running 280 production assets — CNC machining centers, assembly robots, conveyor systems, and HVAC units supporting controlled manufacturing environments. The maintenance strategy was reactive: equipment was repaired after it failed.

The cost of this approach had become quantifiable. Over the previous fiscal year, unplanned downtime had caused:

$4.2M in lost production output across 847 unplanned downtime events
$680,000 in emergency maintenance parts — purchased at spot prices when planned procurement would have cost 30-40% less
3 customer delivery delays that resulted in contractual penalty payments totaling $290,000
Elevated safety incident risk — three of the year’s twelve OSHA recordable incidents were associated with equipment failure events

The plant manager had read about predictive maintenance but was skeptical of the investment: “We’ve had sales people come in here and talk about AI predicting failures. I want to know: which motor, which bearing, how far in advance, and how often is it wrong?”

That framing shaped the engagement design.

Solution: Asset-First Approach with Measurable Anomaly Detection

The engagement began not with AWS, but with the maintenance team’s history. Three months of work order data, downtime records, and the plant historian (an on-premises OSIsoft PI installation) were analyzed to identify which asset types had the most frequent and most expensive failure modes.

The analysis identified three priority asset categories that accounted for 71% of unplanned downtime cost:

CNC spindle motors (38 assets) — bearing failures averaging $82,000 per event in downtime and parts
Robotic arm servo drives (64 assets) — drive overtemperature failures averaging $34,000 per event
Conveyor drive systems (22 assets) — belt tension failures averaging $28,000 per event

These 124 assets became the Phase 1 scope. The remaining 156 assets were connected for data collection but predictive models were deferred to Phase 2.

Edge Layer: AWS IoT Greengrass v2 Gateways

Four AWS IoT Greengrass v2 gateways were deployed — two per facility — on industrial-grade fanless PCs in the existing OT network DMZ. Greengrass gateways provide:

OPC-UA protocol translation. The Siemens and Fanuc PLCs controlling the CNC machines, robots, and conveyors expose OPC-UA servers. Greengrass reads from these OPC-UA servers at 100ms intervals (10 Hz sampling) for priority assets and 1-second intervals for standard assets. No changes to PLC configuration were required.

Edge buffering for offline-first operation. Production lines cannot wait for cloud connectivity. Greengrass buffers up to 72 hours of telemetry locally, replaying to AWS IoT Core when connectivity is restored after any outage.

Local alert logic. Critical safety alerts — spindle overtemperature above 95°C, servo drive fault codes — are evaluated by a Greengrass Lambda component locally. Alert notification to plant floor HMI screens happens within 400ms, independent of cloud connectivity.

AWS IoT SiteWise: Asset Modeling and Anomaly Detection

AWS IoT SiteWise asset models were built to mirror the manufacturer’s equipment hierarchy:

Facility (2)
└── Production Line (14)
    └── Machine (280)
        └── Asset Properties
            ├── Spindle Speed (RPM)
            ├── Spindle Load (%)
            ├── Vibration X/Y/Z (g)
            ├── Bearing Temperature (°C)
            └── Drive Current (A)

IoT SiteWise native anomaly detection was configured for the 124 priority assets. The anomaly detection training process uses 6 weeks of historical data to learn normal operational patterns for each asset — accounting for different operating modes, product changeovers, and shift-related load variations.

The anomaly model outputs a continuous anomaly score for each monitored property, ranging from 0 (normal) to 1 (highly anomalous). SiteWise alerts are configured at three thresholds:

Score	Severity	Response
0.7 - 0.85	Watch	Log to maintenance queue, review at next shift change
0.85 - 0.95	Warning	Notify maintenance supervisor via SNS → SMS
>0.95	Critical	Page on-call maintenance tech + create CMMS work order automatically

Real-Time Alert Pipeline

The alert flow from anomaly detection to maintenance action:

IoT SiteWise Anomaly Score > 0.85
    → SiteWise Alarm → CloudWatch Alarm
    → SNS Topic
    → Lambda: Create Work Order in Maintenance CMMS (API call)
    → Lambda: Send SMS to Maintenance Supervisor (SNS → SMS)
    → Lambda: Log to Maintenance Dashboard (DynamoDB)

End-to-end latency from SiteWise anomaly detection to SMS delivery is consistently under 8 seconds — within the latency budget for maintenance personnel to reach equipment before a borderline condition becomes a failure.

OEE Dashboard: Amazon QuickSight

Equipment effectiveness data flows from IoT SiteWise to Amazon QuickSight via a scheduled Glue job that extracts hourly OEE calculations into a Redshift Serverless warehouse. The QuickSight dashboard provides:

Live OEE by line (updated hourly)
30-day OEE trend by asset category
Downtime Pareto — top 10 downtime causes by duration and frequency
Predictive maintenance queue — assets sorted by current anomaly score

Plant managers access the dashboard on tablets mounted at each line entry. The maintenance team sees a personal queue view showing their assigned work orders ranked by anomaly severity.

Results: 40% Downtime Reduction in 9 Months

The Phase 1 deployment covered 124 priority assets across both facilities. Results measured over the 9 months following full deployment:

Unplanned downtime: 847 events in the baseline year → 506 events in the 9-month measurement period (annualized). A 40% reduction in event count, and a 44% reduction in total downtime hours (the predictive detections caught failures earlier, when repairs are faster).

Predictive alert accuracy: 847 work orders created by the predictive system in the 9-month period. Post-work-order analysis by maintenance team: 78% were confirmed as legitimate anomaly conditions that maintenance action addressed before failure. 14% were false positives where no defect was found. 8% were unverified (maintenance found no root cause). The false positive rate was higher than target in the first 60 days, declining as SiteWise anomaly models accumulated more operational data.

Maintenance cost: Emergency parts procurement fell 28% in the measurement period as advance warning enabled planned parts ordering. Total maintenance cost per production unit decreased 22%.

Return on investment: The Phase 1 deployment (hardware, software, and professional services) cost $340,000. The 9-month savings in downtime, parts, and avoided penalties totaled $1.2M — a 3.5x return in the first year of operation.

Plant manager assessment: “In nine months I’ve had three warning alerts that I’m convinced stopped a catastrophic failure — a spindle bearing on our highest-value machining center that we would never have caught until it seized. That one event alone would have cost us more than the whole project.”

Phase 2 — extending anomaly detection to the remaining 156 assets and adding energy consumption analytics — began in month 7, funded entirely from Phase 1 savings.

This case study describes a composite engagement based on anonymized client work. All identifying details have been removed or modified.

IoT Predictive Maintenance on AWS: 40% Reduction in Unplanned Downtime