Senior Data Engineer – AWS, RAG Pipelines

Posted 1hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Senior Data Engineer designing and operating cloud data infrastructures for AI initiatives. Building data lakes on AWS and real-time pipelines for RAG systems.

Responsibilities:

  • Design and operate the cloud data infrastructure powering AI initiatives.
  • Architect production-scale data lakes on AWS.
  • Build real-time ingestion and observability pipelines.
  • Own the vector search and embedding layers that feed RAG systems and autonomous agents.

Requirements:

  • Overall Experience: 7+ years in Data Engineering, Distributed Systems, or Data Architecture
  • AWS & Infrastructure: 4+ years architecting production-scale data lakes, storage tiers, and event streaming
  • AI/LLM Pipelines: 2+ years building RAG systems, managing embeddings, and orchestrating foundational models
  • Proficiency in AWS Data Lake Architecture & Storage
  • Proficiency in Real-Time Observability & Log Analytics
  • Proficiency in Elasticsearch & OpenSearch Optimization, Vectorization, Embeddings
  • Proficiency in Amazon Bedrock & Generative AI Pipelines
  • Proficiency in Software Engineering & API Ingestion
  • Production-level proficiency in one or more of: C# (.NET Core), Java, Python, or Node.js
  • AWS S3 partitioning strategies, lifecycle policies, and columnar formats (Parquet, Iceberg)
  • AWS Glue Data Catalog and Lake Formation for multi-tenant, fine-grained access control
  • Query optimization over petabyte-scale datasets using Amazon Athena and Redshift Spectrum
  • Distributed oTel collector configuration for log, trace, and metrics capture and routing into S3
  • High-volume streaming of system logs, Datadog captures, and raw server events into S3
  • Real-time CDC from PostgreSQL using Debezium or AWS DMS
  • Amazon OpenSearch clusters with simultaneous lexical and high-dimensional vector search
  • OpenSearch index lifecycle management, sharding strategies, and dynamic mappings at scale
  • Amazon Bedrock foundational model APIs (Claude, Titan) for data enrichment, classification, and semantic parsing
  • Knowledge Bases for Amazon Bedrock for automatic chunking, metadata extraction, and vector index syncs from S3
  • ETL/ELT pipelines ingesting unstructured event data from SaaS APIs (e.g., Pendo, Hotjar, Google Analytics)
  • MCP server development to expose data lake context and utilities to AI agents

Benefits:

  • Remote work.
  • 13 floating holiday.
  • 15 vacation days per year completed.
  • Good working environment.