LLM Evaluation Engineer

Posted 2ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Develop evaluation systems for AI behavior in enterprise environments using LLMs and real-time data. Collaborate on guardrails and enforcement strategies for AI compliance.

Responsibilities:

  • Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
  • Design and tune guardrails, classifiers, and semantic judgment systems in real-time
  • Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
  • Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
  • Prototype, tune, and productize small language models for classification, labeling, or scoring
  • Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
  • Build tools to observe, debug, and improve evaluator performance across data distributions
  • Define abstractions for reusable evaluation components that can scale across use cases

Requirements:

  • 7+ years of experience in ML systems or AI engineering roles
  • At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
  • Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
  • Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
  • Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
  • Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
  • Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production

Benefits:

  • Generous benefits
  • Market cash compensation
  • Above-market equity
  • Well-designed benefits