Observability Engineer – Prometheus

BEACHWOOD OH | Transportation | Posted: 1 day ago |

Job Description:
Our client is a global leader in transportation and logistics, operating one of the most complex infrastructure environments in the industry. You’ll be stepping into an active build of an enterprise-wide observability platform running on Prometheus, Grafana, and Loki, where server telemetry is actively in flight. You’ll help lead the development and launch of the next phase to integrate all databases, storage, networking, and application transactions into one unified view. As the Observability Engineer, you will support the expansion and maturation of this platform. This is a high-visibility role with real influence over architecture decisions and the opportunity to help launch a tip-of-the-spear initiative combining infrastructure, automation, and emerging AI. If you’re a passionate observability engineer ready to build at enterprise scale, let’s talk!

Requirements:
• 3+ years of infrastructure or platform engineering experience with hands-on Prometheus exposure including target scraping and PromQL querying
• 1+ year of experience utilizing Grafana to build dashboards with an eye for clean, readable data visualization
• Comfortable working in a Linux command line environment, including basic service management and file editing
• Ability to read and edit YAML and JSON files without syntax errors
• Interest in modern observability standards like OpenTelemetry and Loki

Nice to have:
• Familiarity with OpenTelemetry (OTel) as a vendor-neutral instrumentation standard
• Basic proficiency in Python, including experience querying APIs or manipulating time-series data
• Comfort working in containerized environments such as Docker or Kubernetes is beneficial

Responsibilities:
• Onboard the remaining systems into the observability platform to expand telemetry coverage beyond basic OS monitoring to include database, storage, network, and application transaction metrics
• Develop and maintain integrations with third-party exporters and agents to ensure complete and reliable data ingestion across all system types
• Design and build multi-tier Grafana dashboards that correlate infrastructure health with application performance and transaction volumes
• Create views that surface key relationships (such as the impact of storage latency spikes on web transaction success rates) to enable faster root cause analysis
• Build and maintain Playwright scripts that simulate critical user transactions, ensuring synthetic uptime data is accurately captured and integrated into the broader monitoring environment
• Refines Alertmanager and Loki configurations to categorize telemetry into clearly defined tiers (Events, Warnings, and Criticals) to reduce alert fatigue for on-call teams
• Audit and enforce consistent labeling conventions across all telemetry sources to ensure time-series data is uniformly structured and ready to support future anomaly-detection models
• Maintain internal wiki documentation covering telemetry schemas, exporter configurations, and operational runbooks
• Collaborate with engineering teams to identify gaps in observability coverage and prioritizes improvements that align monitoring maturity with business and reliability goals