Operations SME / Senior Site Reliability Engineer (m/w/d) Kubernetes / on premise - Remote & Berlin oder FFM

Startdatum:

12.01.2025

Enddatum:

30.04.2026 + Option

Beschäftigungsart:

Freiberuflich

Region:

Remote / FFM / Berlin


Beschreibung:

Für unseren Kunden aus dem Energiesektor suchen wir ab Anfang Januar erfahrene Unterstützung im Bereich Operations für eine neu aufgebaute Plattform. Die Tätigkeit erfolgt Remote mit regelmäßigen Treffen 1-4 Tage am Stück nach Absprache in Frankfurt oder Berlin. Über die Erstbeauftragung hinaus besteht eine längerfristige Verlängerungsoption.

 

The focus is on providing products and services that empower Engineering and Operations teams under the program, supporting the needs of Product Engineering. The Platform Team is committed to delivering, managing, and optimizing essential DevOps tools, which facilitate seamless continuous integration (CI), continuous development (CD), and delivery across the platform and its services.

 

Consulting on CI/CD and Operational Readiness

Objective: Consulting on CI/CD pipelines and ensure operational readiness for deployments

 

Tasks:

• Validation of deployment artifacts from an operations perspective.

Defining and enforcing quality assurance measures (e.g. required documentation of standard operation procedures,

successful test reports, …) to ensure the high quality of delivered products and services.

• Ensuring rollback strategies and operational monitoring (observability) are in place for production deployments.

 

Monitoring, Incident, Problem and Change Management

Objective: Ensure operational stability and responsiveness

Tasks:

• Monitoring system health, performance metrics, and service availability across multi-tenant environments.

• Identifying, analyzing, and resolving incidents, minimizing service disruption.

• Triggering root cause analysis and implementation of corrective and preventive actions.

 

Automation of operations critical standard processes following established software development lifecyles

Objective: Reduce operational toil and improve service reliability

Tasks:

• Address recurring operational issues by automating remedial standard operations processes

• Validate all automated procedures following the established software development lifecycle including staging, testing,

and validation reviews

 

Security and Compliance Enforcement

Objective: Ensure platform operations adhere to security and compliance standards

Tasks:

• Implementing monitoring and logging strategies to support audit and compliance requirements.

• Performing routine security scans and remediating identified vulnerabilities.

 

Runbooks and documentation

Objective: Ensure documentation in accurate and up to date

Tasks:

• Provide feedback to owners of runbooks

• Provide improvements to runbooks

 

 

Must-have experience

• At least of 5 years of operational experience with self-managed Kubernetes clusters, self-managed services providing

Kubernetes clusters and productive applications or systems in on premise environments

• Deep understanding and expertise in networking concepts, including protocols, load balancing, and security.

Profound knowledge and implementation experience with CI/CD processes, tooling (e.g. GitLab, Jenkins, Tekton,

Argo Workflows, and Argo CD), concepts and associated quality and security assurance for software delivery

• Fundamental understanding of core operations processes (incident management, change management, problem

management, IT Service Management) as well as SRE concepts

• Experience in gathering operational insights from monitoring or observability including SLI/SLA/SLO management

and tracking.

• Hand-on experience in documenting procedures properly and enforcing clear runbooks or playbooks.

• Hands-on experience with monitoring and logging tools (e.g., Prometheus, Grafana, Datadog).

 

Must-have language skills:

• Proficiency in both speech and writing in English (at least C1).

 

Preferred experience

• Project experience in software engineering (in Go Lang, C/C++ or Python) with significant experience in building

RESTful services in distributed environments.