Site Reliability Engineer
Site Reliability Engineer
Toronto, Ontario - Permanent
Site Reliability Engineers (SREs) are responsible for keeping production systems; both those that are internally and externally facing; running smoothly as a unit. You are responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning.
SREs should be pragmatic technical operators and software tool makers. You are able to apply proven engineering principles, operational discipline and bring a mature perspective to automation of production systems.
Responsibilities:● Be responsible for protecting the health of all production systems
● Work with engineering teams to: make deployment as boring as possible and implement baseline technologies, policies and practices to ensure monitoring is baked in from the start
● Prevent outages from happening. You will help to stand up and support production monitoring systems. You will find ways to detect problematic symptoms in all parts of the stack before they result in a systems outage. You will also provide guidance to engineering on how to best leverage those monitoring systems.
● Design, build and maintain core production infrastructure pieces.
● Establish strong relationships with engineering and product teams to communicate the current state of production and what direction needs to be taken to ensure systems are scalable by design and not by accident.
● Indicate where “the safety rails are” - engineering will rely on you to provide guidelines on what constraints must be met to allow horizontal and vertical scaling of systems considering I/O and compute limitations.
● Debug production issues across services and at different levels of the stack.
● Plan and budget infrastructure
Must Have Skills:
● Embrace DevOps
● Have strong programming skills in Python, Golang or backend programming experience
● Know your way around standing up, maintaining and providing guidance to engineering on how to maximize the utility of monitoring infrastructure in a Cloud Environment
● Don’t like explaining things twice - so you document practices and processes incessantly
● Have experience with CircleCI, Drone.io, k8s, helm and terraform
● Have experience with re-engineering monolithic services into microservices
● Believe that customers should be warned of service degradation, and that alerts to engineering should be both rare and noise free
Projects you may work on:● Migrating legacy systems from Google App Engine into a GKE or Anthos environment
● Assist engineering with capacity planning and authoring minimum acceptable performance requirements for applications
● Work with engineering to improve our deployment and testing processes
● Implementing baseline template projects to assist engineering so that all application containers comply to minimum standards for access control, error handling and logging
● Manage our Airflow infrastructure including monitoring, cluster scale up and scale out, alerting engineering with cloud resource limitations are approaching
● Our Google Cloud Platform architecture and implementation including authentication, networking, data storage, data loss prevention, and disaster recovery
● Compliance audits and requirements across our platform and cloud infrastructure ex. PCI, SOC 1&2, Penetration testing and remediation plans
● Crafting, implementing, and refining controls over our Google Cloud infrastructure including access control, incident response plans, and audit/security logging