T3 Plattform Operations Specialist / Operations Manager (m/w/d) compute/ ITSM / SLA

Startdatum:

Mai

Enddatum:

Ende 2026 + Option

Beschäftigungsart:

Freiberuflich

Region:

Remote + FFM oder Berlin

Beschreibung:

Für unseren Kunden im Energiesektor suchen wir ab Mai erfahrene Unterstützung als T3 Plattform Operations Specialist / Operations Manager (m/w/d) compute/ ITSM / SLA - Remote & FFM or Berlin. Die Tätigkeit erfolgt remote und nach Absprache ca. 1-3 Tage im Monat vor Ort.

Project Description

The team is building an internal platform for software product developers to accelerate the development and delivery of software products to tackle the massive challenges facing the energy sector. The Platform is a service oriented, cloud-native platform that is being built to provide application teams with self-service capabilities to develop, run and operate their software products. It provides services for application infrastructure, data, service lifecycle management, application build and delivery as well as services to operate their software products. The Platform is deployed as a hybrid cloud, encompassing both private cloud and selected public clouds.

General Description

Local Operations manages the on-premises production platform, which serves as the primary host for all mission-critical business applications . Local operations are responsible for the following core areas:

• Platform Stability: Ensuring the high availability and performance of the on-premises private cloud environment.

• Application Hosting: Consulting on the seamless operation of Germany-specific productive business applications.

• Incident Management: Resolving technical issues within standard business hours to minimize operational downtime.

• Lifecycle Maintenance: Executing routine updates, patches, and system optimizations within the local infrastructure.

Scope of Work

Provide Tier-3 operational ownership for Compute & Operating System services for Local Production (DE).

Tasks:

• Handling of complex incidents, deep troubleshooting, and root cause analysis; drive permanent fixes and preventive measures.

Handling of operational Readiness

Objective: Ensure operational readiness for deployments

Tasks:

• Validation of deployment artifacts from an operations perspective.

• Defining and enforcing quality assurance measures (e.g. required documentation of standard operation procedures, successful test reports, …) to ensure the high quality of delivered products and services.

• Ensuring rollback strategies and operational monitoring (observability) are in place for production deployments.

Monitoring, Incident, Problem and Change Management in the specific context of providing Compute & Operating System

Objective: Ensure operational stability and responsiveness for the managed Kubernetes platform

Tasks:

• Monitoring system health, performance metrics, and service availability across multi-tenant environments.

• Identifying, analyzing, and resolving incidents, minimizing service disruption.

• Triggering root cause analysis and implementation of corrective and preventive actions.

Automation of operations critical standard processes

Objective: Reduce operational toil and improve service reliability

Tasks:

• Addressing operational issues by automating remedial standard operations processes.

• Validation of all automated procedures following the established software development lifecycle including staging, testing, and validation reviews.

Security and Compliance Enforcement

Objective: Ensure platform operations adhere to security and compliance standards

Tasks:

• Implementing monitoring and logging strategies to support audit and compliance requirements.

• Performing routine security scans and remediating identified vulnerabilities.

Profile Requirements

• The contractor must be a senior level professional with proven experience in operations management of private cloud solutions, proficiency in managing compute & OS operations on the platform with following experience:

• 5-10+ years in IT operations / service delivery / platform operations with demonstrated leadership in mission-critical environments.

• Proven experience implementing/leading Incident, Problem, Change, Release governance in production.

• Expertice with ITSM: Jira Service Management (JSM), Jira, Confluence.

• Experience of core operations processes (incident management, change management, problem management, IT Service Management) as well as SRE concepts.

• Experience in gathering operational insights from monitoring or observability including SLI/SLA/SLO management and tracking.

• Hand-on experience in documenting procedures properly and enforcing clear runbooks or playbooks.

• Observability Hands-on experience with monitoring and logging tools (e.g., Prometheus, Grafana, Datadog, Mimir, Loki).

• Familiarity with enterprise DevOps toolchains is a plus (GitLab, JFrog Artifactory, Backstage, Harness).

• Expertise within modern platform operations (Kubernetes/containers, automation, observability), sufficient to govern specialists.

• Platform delivery concepts: GitOps and IaC awareness (Terraform/OpenTofu, ArgoCD, Helm) to govern deployment/readiness standards.

preferred experience

• Experience operating in regulated / high-availability industries (banking, telco, public sector, healthcare).

• Experience with SRE practices (SLOs/SLIs, error budgets) and reliability management.

Must-have language skills

• Proficiency in both speech and writing in English (at least C1).

• Proficiency in both speech and writing in German (at least C1).