Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Remote/Telecommute JobREMOTE / Toronto, Ontario, Canada  - Permanent
This job allows you to work remotely 

Job Description

As a SRE combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that business critical services have reliability, uptime meeting SLA requirements and fast rate of improvement. Additionally SRE’s will manage systems capacity and performance, right-sizing the infrastructure at identified opportunities. Much of our infrastructure development focuses on optimizing existing systems, building infrastructure and eliminating work through automation capabilities.

The SRE will manage the complexity of scale while implementing your knowledge and skill set for infrastructure analysis and system design. You will expand monitoring capabilities across various systems, establishing proper reporting metrics in a consumable format to business leadership.

· Focus on the lifecycle of infrastructure services, from inception to operational readiness while delivering continuous improvement opportunities
· Implement service level objectives (SLOs)
· Support business critical services through:
· Launch readiness
· Capacity planning
· System design and implementation
· Software development framework and solutions
· Business and technology consultancy
· Maintain business critical services at production level through measuring and monitoring availability, establishing overall system health
· Scale systems sustainably through automation and other technical capabilities
· Partner with DevOps and IT Ops to rationalize maturing capabilities and target state requirements to sustain delivery of service for the company

Must Have Skills:

Minimum qualifications:
· BSc in Computer Science or a related technical field involving system engineering
· Experience programming in one or more of the following languages: Java, Python or others suitable
· Experience with algorithms and data structures

Preferred qualifications:
· Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
· Experience in Infrastructure as Code; Terraform, Puppet, Ansible
· Experience in DevOps program discipline
· Experience in monitoring platforms and solutions (AppD, New Relic, Dynatrace, Nagios)
· Understanding of Unix/Linux operating systems
· Ability to troubleshoot, optimize code and automate routine tasks as identified
· Ability to implement monitoring platforms and capabilities to cover business critical services

Knowledge, Skills & Abilities:
· Exhibits expert knowledge of all aspects of technology and of architecture concepts and practices
· Able to drive from strategy to architecture to design & implementation
· Demonstrates ability to work through various levels of program maturities and define target state architectures for company alignment and execution
· Possesses expert knowledge of Infrastructure as a Code practice
· Monitoring capabilities and implementation for maturing overall ITSM programs
· Demonstrates expert knowledge of architecture frameworks and methodologies
· Growth mindset and self-empowerment
· Possesses advanced communication skills, both written and oral
· Evangelize application owners along the cloud journey and guide the developers
· Exhibits strong influencing skills to sell the architectural ideas to business partners
· Exhibits solid relationship and managerial leadership abilities


Starting: ASAP

Similar jobs in Toronto:

Similar jobs in other locations: