- Monitor systems using tools like Prometheus, Grafana, and Stackdriver; build alerts and dashboards to ensure observability and uptime. - Participate in incident response, root cause analysis, and ...