OT Change and Incident Management: How to avoid the Titanic moment

Why OT needs structure more than ever and what ITIL can (and can’t) teach us about handling change, failure and recovery in industrial systems

Sep 07, 2025

Let’s start in the real world

Everyone saw the iceberg. The problem was: the slow down was too late. The Titanic didn’t sink because of bad luck, it sank because known risks were ignored, warnings went unheeded and there were no real processes to deal with crisis. In OT environments, it’s often the same: we know the risks, we see the signs, but we’re missing structure.

Picture this: It’s a regular Thursday morning at a high-speed bottling plant in southern Germany. Production is humming. Operators monitor multiple lines. Everything runs like clockwork until someone schedules a firmware update to the PLC1 that controls Line 3’s conveyor system. The update is approved informally and applied during a shift handover, assuming it won’t interfere with production.

Within 30 seconds, bottles start piling up and falling over. The line jams. A sensor fails to react. The entire cell halts. Operators respond quickly, but the issue isn’t immediately clear. There’s no detailed change log. No formal rollback strategy. No incident response plan tailored to this scenario. Downtime ticks on, with every minute costing thousands.

Now rewind the story, but imagine it in a typical IT context. A software patch is applied to a CRM2 server. Users can’t log in. But IT has a structured response:

The issue is logged.
An incident is opened.
A change record links to the patch.
A rollback plan exists.
The team meets later to analyze what went wrong and how to prevent it.

Same logic, different world. This article explores why OT (Operational Technology) needs its own approach and how to borrow the best of ITIL3 without forcing a square peg into a round hole.

Why It Matters

In IT, a failed change might mean frustrated users. In OT, it could mean production losses, safety risks or regulatory violations.

OT systems run:

Manufacturing lines
Water treatment facilities
Power generation
Railway signals
Building automation systems

And unlike IT, these systems were not designed with frequent change or rapid incident response in mind. Many still run legacy hardware or software not touched in over a decade.

But change is coming:

Digital transformation is connecting OT systems to networks and clouds.
Cyberattacks on OT environments are rising sharply.
New compliance and safety standards demand structured operations.

You can’t wing it anymore. Even if your OT team isn’t big, you still need a playbook.

ITIL in a nutshell (for comparison)

ITIL (Information Technology Infrastructure Library) is the de-facto standard in IT for managing services, systems and processes. It defines:

A. Incident Management

Restore normal service as quickly as possible.
Log, prioritize, escalate, resolve and close.
Examples: Server crash, user login failure, application error.

B. Change Management (now Change Enablement in ITIL 4)

Ensure that changes are assessed, approved and implemented with minimal disruption.
Risk assessment, CAB4 meetings, rollback plans.

C. Problem Management

Identify and remove the root cause of recurring incidents.
Conduct root cause analysis, track known errors.

It also introduces:

Defined roles (e.g., Incident Manager, Change Manager)
Workflows supported by ITSM tools (like ServiceNow or Jira)
Focus on service quality, risk reduction and continuous improvement

The OT World: What's different? (simplified)

OT doesn’t lack discipline. But it has different priorities, constraints and historical context.

Key Differences:

Story Example: An IT team deploys weekly updates to improve user experience. In a paper mill, even rebooting a PLC may require halting production, draining tanks and safety clearance.

Frameworks that help in OT

Let’s look at the major frameworks and how they support structured incident, change and problem processes in OT environments.

A. IEC 62443 (especially part 2-1)

This is the leading global standard for OT cybersecurity and operational governance. Think of it as the OT equivalent of ISO/IEC 27001, but with safety and reliability front and center.

What it offers:

Requirements for a Cybersecurity Management System (CSMS)
Guidance on secure system design, integration, maintenance
Maturity models for assessing process consistency

How it maps to ITIL-style processes:

Change Management:
- Documented change process
- Impact and risk assessment required
- Mandatory testing and rollback
Incident Management:
- Incident response plans
- Defined escalation paths
- Role-specific responsibilities (e.g. integrator, asset owner)
Problem Management:
- Root cause analysis as part of continuous improvement
- Structured documentation of learnings

Use Case Example: A machine builder wants to update SCADA software at a client site. Under IEC 62443, they must submit a documented change request including:

Purpose
Risk assessment
Test results
Backup strategy
Rollback plan
Communication with all involved parties

B. NIST SP 800-82 Rev. 3 + NIST Cybersecurity Framework (CSF)

SP 800-82 is a U.S. publication, but globally respected. It applies the NIST CSF to OT systems.

Core functions:

Identify: Asset inventory, roles, data flows
Protect: Access control, maintenance, data integrity
Detect: Anomalies, log monitoring
Respond: Contain, analyze, report incidents
Recover: Plans to restore systems and learn

How it aligns with ITIL-style logic:

Clear steps for incident response and containment
Encourages planning around configuration changes
Emphasizes feedback loops for process improvement

Example: An oil pipeline control center detects odd valve behavior. Following NIST:

Detect abnormal behavior (via logs or operators)
Trigger incident response
Isolate affected segment
Notify stakeholders
Analyze root cause
Update system rules and training

C. ISO/IEC 27001 (with OT integration)

While not OT-specific, ISO 27001 can support security governance in OT, especially in regulated industries (e.g. pharma, energy).

Supports structured risk management
Can be combined with IEC 62443 for certification

Detailed comparison: OT vs. ITIL processes

Challenges in Bridging IT and OT

Language Barrier
- IT talks services, SLAs, apps
- OT talks loops, sensors, safety interlocks
Tooling Gap
- IT has mature platforms; OT often lacks centralized tooling
Change Aversion in OT
- Not cultural laziness, but safety-first thinking
Incident Visibility
- Many OT incidents (e.g. unplanned stops) aren’t logged as such
Problem Follow-up
- “We fixed it” often replaces “we understood why it happened”

Getting started (without overengineering it)

Starting structured OT processes doesn’t mean implementing ITIL overnight. It means building lightweight, useful habits that create clarity and reduce risk.

A. Build a simple Asset Inventory

List your key control systems, their location, owner and purpose
Track firmware/software versions
Include critical dependencies (e.g. power, cooling, network links)

Why? You can’t manage changes or respond to incidents if you don’t know what you have.

B. Define roles and responsibilities (lightweight RACI)

Create a one-page matrix answering:

Who approves a change?
Who can execute it?
Who needs to be informed?
Who leads during an incident?

Tip: Keep it visual and post it near operator stations or control rooms.

C. Create a basic change checklist

Use paper, Excel or digital form. For each planned change:

What is being changed?
Why is it needed?
What is the expected impact?
Who tested it, where and when?
What’s the rollback plan?
When will it happen?
Who approved it?

Bonus: Add a checkbox: “Will this impact safety, quality or regulatory compliance?”

D. Establish a simple Incident Response Playbook

For example:

Detection: Alarm, operator report, unusual behavior
Containment: Isolate the system if possible (e.g. unplug network cable)
Notification: Call pre-defined contacts
Documentation: What happened? When? What was done?
Recovery: Apply fix, validate functionality
Review: Document learnings

Deliverable: Print this as a laminated card and mount near SCADA workstations.

E. Introduce "Lessons Learned" Rituals

After any incident or failed change:

Host a 20-minute team huddle
Ask: What happened? Why? What can we improve?
Write it down, even in a shared Word file or notebook

Cultural tip: Celebrate the fact that you reviewed, not that it was perfect.

F. Train using real incidents

Once a month, review a real event:

Was it handled well?
What would we do differently?
Are our procedures clear?

Benefit: Makes documentation practical, not bureaucratic.

G. Build from the bottom up

Don’t wait for enterprise software. Use what you have:

Excel or Google Sheets for tracking
Printed forms for incident logs
Shared folders for documentation

Once habits form, you can layer better tools later (e.g. asset management platforms, change tracking systems).

Why it’s more critical than ever to formalize OT processes (with verified data)

1. Rising Cyber Threats in OT amid IT/OT convergence

OT systems no longer run in isolation, they’re increasingly connected to IT and cloud environments. According to ITPro, ransomware and wiper malware incidents in OT environments rose sharply from 32% in 2023 to 56% in 2024. This trend reflects the growing vulnerability as barriers between OT and IT vanish.

Additionally, the 2025 Security Navigator Report noted a 39% increase in cyberattacks targeting OT systems between 2023 and 2024.

2. Critical infrastructure under fire

A Semperis survey, cited by Infosecurity Magazine, found that 62% of water and electricity operators in the US and UK were targeted by cyberattacks in the past year. Of those, 80% were attacked multiple times, 59% experienced operational disruption and 54% suffered permanent data or system damage.

3. Legacy OT Systems heighten risk

These environments often rely on decades-old hardware and software lacking patching, encryption or modern security controls, not designed for today's connected cybersecurity reality.

4. Regulatory pressure and need for resilience

Governance expectations now include proactive security and process documentation. In the EU, this pressure is mounting through directives like NIS 2 (Network and Information Security Directive) and the Cyber Resilience Act (CRA).

NIS 2 requires operators of essential and important entities to establish structured cybersecurity risk management, incident reporting and governance processes, including for OT systems.
CRA mandates secure-by-design principles and lifecycle cybersecurity controls for connected devices and industrial products placed on the EU market.

In the U.S., the EPA reports that as of late 2024, 97 drinking water systems serving 26.6 million people have critical or high-risk vulnerabilities, raising the bar for OT resilience planning.

5. Reputational and Business Continuity costs

It's no longer just downtime, it’s trust, safety and financial fallout. And the stakes are only rising, as seen in recent utility and critical infrastructure breaches.

6. Building resilience through process discipline

Structured processes (change control, incident response, root-cause reviews) become resilience enablers. Organizations that manage without them risk repeating mistakes.

Summary table: Why process maturity in OT is non-negotiable today

Final Thoughts

You don’t need to force ITIL into OT. But you can (and should) build a process culture that reflects the same ideas:

Plan before you change.
React quickly and safely when things go wrong.
Learn and improve after every event.

Frameworks like IEC 62443 and NIST SP 800-82 give you guidance adapted to the realities of OT, where systems are physical, risks are real and failure isn't just an error message.

Start small. Be pragmatic. Involve your people. And over time, bring structure to the chaos without getting in the way of the work.

What else should be considered? (important additions)

To round out the picture, here are additional key areas that make your OT process model robust and future-proof:

A. OT-Specific roles and responsibilities

While ITIL has clearly defined roles, OT environments often operate with lean teams. Still, clarity is critical.

Typical OT Roles:

Asset Owner - Responsible for lifecycle decisions and approvals
Control System Engineer - Designs and maintains the automation logic
Maintenance Lead - Manages availability and response to technical failures
OT Security Lead - Aligns security controls with process needs
System Integrator - Executes changes across multiple systems or vendors

Why this matters: Clear roles avoid confusion in high-pressure situations and improve coordination with IT.

B. Change testing and validation in OT

Unlike IT, OT systems can rarely be duplicated for test environments. That doesn’t mean testing is optional.

Options in OT:

Simulation mode: Some controllers offer a virtual run-through
Offline testing: Use spare hardware or cloned PLCs
Shadowing: Monitor a proposed change in read-only mode first
Digital Twin: Advanced but effective for complex plants

Goal: Always validate logic and side-effects before deploying to production.

C. Cross-functional integration

Change or incident processes in OT don’t happen in isolation. They must align with:

Production planning: To schedule changes without disrupting OEE
Quality assurance: To validate product integrity post-change
Health & Safety: To ensure safe work procedures during interventions

Tip: Involve these functions early in your templates or workflows.

D. OT-specific KPIs and metrics

If you can’t measure it, you can’t improve it. But classic IT metrics often miss the mark in OT.

Useful OT Metrics:

Change Success Rate (CSR) - % of changes without rollback or incident
Mean Time to Repair (MTTR) - Average recovery time from incidents
Unplanned Downtime per Line/Asset - Tracks stability over time
Recurring Incident Rate - Helps identify weak spots in root cause work

Use: Simple Excel dashboards can suffice at first.

E. Tools and practical workarounds

Not every OT environment has a ServiceNow license. But you don’t need one to get started.

Practical Tools:

Excel/SharePoint: Use structured templates with dropdowns
Paper-based logs: Still useful in field operations
Shift logs: Expand them to capture incidents and changes
Low-code apps: Use tools like Microsoft Power Apps for forms

Key is consistency, not sophistication.

F. Maturity Model for OT process adoption

Where are you today and where do you want to be?

Advice: Aim for Level 3 before considering software or certification.

A note for Experts: What this article is and isn’t

This article is not a blueprint for compliance or a substitute for deep technical security architecture. It’s a practical guide for OT leaders, plant managers and technical decision-makers who need to bring more structure into environments where safety, reliability and legacy constraints shape reality.

Yes, every plant is different. Yes, not all IEC 62443 parts apply equally to every sector. And yes, ITIL wasn’t made for PLCs or real-time loops. But the core message stands: in modern OT environments, lack of structured processes is no longer defensible. Not operationally, not legally and not in front of your customers.

The goal here is clarity, not completeness. The analogies are deliberate simplifications, because a well-run shopfloor needs clear thinking more than it needs buzzword fluency.

If you're already operating at Maturity Level 4+, this article probably isn’t for you. But if you're somewhere between tribal knowledge and Excel sheets, this might just help get the next conversation started.

Discussion about this post

Ready for more?