Site Reliability Engineer

CereCore

Site Reliability Engineer

Nashville, TN
Full Time
Paid
  • Responsibilities

    Classification: Contract
    Contract Length: 12-months

    Position Summary

    As a Senior Site Reliability Engineer (SRE), you will provide SRE best practices for mission-critical applications across the enterprise. When these applications fail, you’ll have the skills and decision-making capabilities to quickly restore services, investigate the root cause, and develop a plan that mitigates future failures. You will spend time analyzing system performance and identifying ways to enhance the reliability of our environments, from developing dashboards, performing configuration changes, building robust monitoring systems, and learning how to leverage automation to drive efficiencies. You will help drive uptime and reliability across the enterprise.

    Responsibilities

    • Support system upgrades, architecture design, implementations, and deployments.
    • Ability to work in a complex organization, navigate multiple verticals of expertise and negotiate, guide direct and influence your peers to provide real solutions.
    • Maintain industry knowledge in software development, architecture, and development products, such as databases, security, and automation products.
    • Promote a collaborative team environment and work closely with colleagues to achieve business objectives.
    • Collaborate with stakeholders (e.g., business stakeholders, product owners, project managers, and end users) to understand functional and non-functional requirements.
    • Lead Investigations and solution proposals to development and design problems.
    • Participate with team members in scope of work estimation and forecasting.
    • Improve performance of existing software by diagnosing and resolving critical issues.
    • Prepare technical documentation, including software & architectural design evaluation plans, data flow diagrams, test results, and technical manuals.
    • Adhere to and influence established development practices and processes.
    • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding.
    • Ongoing review of technology, infrastructure, and code to enhance and build resiliency into the applications.
    • Create sustainable systems and services through automation and uplifts.
    • Balance feature development & deployments with speed, reliability, and well-defined service-level objectives.
    • Partner with development teams and vendors of 3rd party applications to improve services through rigorous testing and release procedures.
    • Build/Develop automations to “self-heal” applications and reduce the toil of manual operational tasks. Pursuit of operational excellence, uptime, and reliability of our applications
    • Participate, lead, and drive in creating postmortem analysis of why services broke or degraded, including recommendations for long-term fixes. It may require going across multiple teams and organizations within the enterprise. Determine root-cause for all production-level incidents and write corresponding high-quality RCA reports.

    Requirements

    • 5+ years of experience a Software development or engineering roles
    • Bachelor's degree in Computer Science or related field
    • Knowledge of infrastructure, frameworks, and software/cloud design patterns for implementing applications in the cloud.
    • Experience in the use and implementation of relevant tools and platforms (e.g., cloud platforms (IaaS and PaaS), web technologies, client-server technologies, continuous integration, and deployment)
    • Experience with version control (Git) and open-source practices.
    • Experience in one or more coding languages. (JavaScript/Typescript, C#, Python, Java, Swift or Kotlin)
    • Experience with automation of CI/CD pipelines
    • Experience with IaC such as Terraform
    • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
    • Experienced in helping define SLIs, SLOs & SLOs, and the experience to build observability to report on operating against those objectives.
    • Strong ability to communicate complex technical information in a condensed manner to various stakeholders verbally and in writing.
    • Ability to build and maintain strong cross-functional partnerships at all levels of the organization.
    • Ability to work, make aligned decisions, plan, and accomplish goals without explicit direction/guidance from leadership.
    • Experience with system architectures, how software systems interact, and integrate
    • Ability to evaluate new technologies to assist senior leadership align it to the HCA Healthcare strategic roadmap.
    • Strong understanding of SRE practices and implementations.
    • Expertise in knowledge of Linux and Windows Systems Administration and how to manage through code.