LLM Evaluation Engineer
Posted 2ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Develop evaluation systems for AI behavior in enterprise environments using LLMs and real-time data. Collaborate on guardrails and enforcement strategies for AI compliance.
Responsibilities:
- Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
- Design and tune guardrails, classifiers, and semantic judgment systems in real-time
- Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
- Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
- Prototype, tune, and productize small language models for classification, labeling, or scoring
- Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
- Build tools to observe, debug, and improve evaluator performance across data distributions
- Define abstractions for reusable evaluation components that can scale across use cases
Requirements:
- 7+ years of experience in ML systems or AI engineering roles
- At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
- Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
- Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
- Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
- Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
- Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production
Benefits:
- Generous benefits
- Market cash compensation
- Above-market equity
- Well-designed benefits



















