Site Reliability Engineer (SRE)

КейДев
Минск

Описание вакансии

Keydev is on the lookout for a Site Reliability Engineer (Application Support Team) to join our Infrastructure and Operations Department.

Main responsibilities:

  • Maintain and enhance monitoring and logging infrastructure.
  • Improve observability processes and implement predictive failure analysis.
  • Optimize alerting systems: reduce noise, fine-tune critical metrics.
  • Define key monitoring parameters and enhance visibility.
  • Support and improve both cloud-based and on-premise environments.
  • Automate processes and configuration management using Infrastructure as Code (IaC) principles.
  • Train and mentor 24/7 App Support staff.
  • Develop Runbooks, documentation, and troubleshooting guides.
  • Analyze incidents, identify patterns, and drive proactive monitoring improvements.
  • Establish and support the Monitoring & Diagnostics group within App Support.
  • Develop intelligent troubleshooting instructions for faster incident resolution.
  • Optimize existing monitoring by reducing unnecessary alerts and adding meaningful metrics.
  • Enhance reliability through structured incident management and post-mortem analysis.
  • Implement GitOps best practices for managing infrastructure and configuration.
Requirements:
  • Advanced Linux user with strong command-line and diagnostic skills.
  • 4+ years of experience as an SRE/Monitoring Engineer.
  • Strong understanding of monitoring, logging, and observability in production environments.
  • Experience optimizing alerting systems and implementing predictive analytics.
  • Hands-on experience managing both cloud and on-premise solutions.
  • Automation skills using Python or Go.
  • Proficiency with configuration management tools (Ansible, Terraform).
  • Solid grasp of networking principles and protocols.
  • Understanding of information security principles.
  • Experience with CI/CD pipelines (GitLab, Jenkins).
  • Familiarity with orchestrators (Kubernetes, Rancher).
  • Experience documenting workflows and training support teams.
  • Ability to create intelligent troubleshooting instructions.
  • Skills in incident analysis and pattern recognition.

Nice to Have:

  • Experience working with high-load systems.
  • Deep understanding of APM tools (New Relic, Datadog, etc.).
  • Database and message queue performance tuning.
  • Advanced knowledge of ML-driven monitoring and predictive analysis.
  • Experience with automated incident response (self-healing systems).

Soft Skills:

  • Responsibility, initiative, and strong analytical thinking.
  • Ability to collaborate effectively within a team.
  • Focus on automation and process improvement.
  • Strong documentation and knowledge-sharing skills.
  • Capability to diagnose complex incidents and provide actionable insights.

Benefits:

  • Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
  • Join us for exciting corporate events that foster team spirit and fun!
  • Indulge in a variety of snacks available in the office.

We will tell you more about all the benefits on the interview :)

This position is planned to be created (promising).

Работодатель

КейДев

KeyDev is your reliable partner in the world of technology and innovation. KeyDev was founded in 2024 in Minsk, we have been committed to providing high-quality and integrated IT solutions... (Подробнее)

Смотреть все вакансии в КейДев
Похожие вакансии
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Все похожие вакансии >>
Ищете сотрудников?
Зарегистрируйте аккаунт работодателя и разместите свои вакансии!