PromptEval Documentation

Professional testing framework for LLM applications. Validate AI responses with ML-powered semantic matching.

v1.0.0Python 3.8+MIT License

Quick Start

Get started with PromptEval in under 5 minutes:

# Install PromptEval
pip install prompteval

# Create a test file
cat > test_example.yaml << EOF
tests:
  - name: greeting_test
    prompt: "Say hello"
    expected: "Hello! How can I help you?"
    threshold: 0.85
EOF

# Run tests
prompteval run test_example.yaml

Installation

Requirements

Python 3.8 or higher
pip or conda
API key from your LLM provider

Install via pip

pip install prompteval

Install from source

git clone https://github.com/prompteval/prompteval.git
cd prompteval
pip install -e .

Verify installation

prompteval --version
# Output: PromptEval v1.0.0

Configuration

Configure PromptEval using environment variables or a config file:

Environment Variables

# API Keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Model Settings
export PROMPTEVAL_MODEL="gpt-4"
export PROMPTEVAL_TEMPERATURE="0.7"

# Semantic Matching
export PROMPTEVAL_THRESHOLD="0.85"
export PROMPTEVAL_EMBEDDING_MODEL="text-embedding-ada-002"

Config File

# prompteval.yaml
api:
  provider: openai
  model: gpt-4
  temperature: 0.7
  
semantic:
  threshold: 0.85
  embedding_model: text-embedding-ada-002
  
output:
  format: html
  path: ./reports
  verbose: true

Writing YAML Tests

Define your tests in simple, readable YAML files:

Basic Test Structure

tests:
  - name: customer_support_greeting
    prompt: "Greet the customer"
    expected: "Hello! How can I help you today?"
    threshold: 0.85
    
  - name: product_recommendation
    prompt: "Recommend a laptop for coding"
    expected: "I recommend a laptop with good CPU and RAM"
    threshold: 0.80
    contains:
      - "CPU"
      - "RAM"
      - "laptop"

Advanced Features

tests:
  - name: complex_test
    prompt: "Explain quantum computing"
    
    # Multiple validation methods
    expected: "Quantum computing uses quantum mechanics"
    threshold: 0.85
    
    # Required keywords
    contains:
      - "quantum"
      - "computing"
    
    # Forbidden content
    not_contains:
      - "classical"
      - "traditional"
    
    # Length constraints
    min_length: 50
    max_length: 500
    
    # Custom validation
    custom_validator: "validators.check_technical_accuracy"
    
    # Timeout
    timeout: 30

Semantic Validation

PromptEval uses ML-powered semantic matching to validate meaning, not just text:

Traditional Testing ❌

assert response == "Hello, how can I help?"
# Fails if response is "Hi! How may I assist?"

PromptEval ✅

validate_semantic(
  response="Hi! How may I assist?",
  expected="Hello, how can I help?",
  threshold=0.85
)
# Passes with 94% similarity

How It Works

Convert text to embeddings using state-of-the-art models
Calculate cosine similarity between vectors
Compare against threshold (default 0.85)
Generate detailed similarity report

HTTP Adapter

Test any API endpoint with zero configuration:

OpenAI

tests:
  - name: openai_test
    adapter: http
    endpoint: https://api.openai.com/v1/chat/completions
    method: POST
    headers:
      Authorization: "Bearer ${OPENAI_API_KEY}"
      Content-Type: application/json
    body:
      model: gpt-4
      messages:
        - role: user
          content: "{{prompt}}"
    response_path: choices[0].message.content
    expected: "{{expected}}"

Anthropic Claude

tests:
  - name: claude_test
    adapter: http
    endpoint: https://api.anthropic.com/v1/messages
    method: POST
    headers:
      x-api-key: "${ANTHROPIC_API_KEY}"
      anthropic-version: "2023-06-01"
      Content-Type: application/json
    body:
      model: claude-3-opus-20240229
      max_tokens: 1024
      messages:
        - role: user
          content: "{{prompt}}"
    response_path: content[0].text
    expected: "{{expected}}"

Custom API

tests:
  - name: custom_api_test
    adapter: http
    endpoint: https://your-api.com/chat
    method: POST
    headers:
      Authorization: "Bearer ${YOUR_API_KEY}"
    body:
      input: "{{prompt}}"
    response_path: data.response
    expected: "{{expected}}"

CI/CD Integration

GitHub Actions

name: LLM Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      
      - name: Install PromptEval
        run: pip install prompteval
      
      - name: Run tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: prompteval run tests/
      
      - name: Upload report
        uses: actions/upload-artifact@v2
        with:
          name: test-report
          path: reports/

GitLab CI

llm_tests:
  stage: test
  image: python:3.9
  script:
    - pip install prompteval
    - prompteval run tests/
  artifacts:
    paths:
      - reports/
  only:
    - main
    - merge_requests

API Reference

Command Line

# Run tests
prompteval run <file_or_directory>

# Run with custom config
prompteval run tests/ --config=custom.yaml

# Generate report only
prompteval report results.json

# Validate YAML syntax
prompteval validate tests/

# Show version
prompteval --version

Python API

from prompteval import TestRunner, SemanticValidator

# Initialize runner
runner = TestRunner(config_path="prompteval.yaml")

# Run tests
results = runner.run("tests/")

# Semantic validation
validator = SemanticValidator(threshold=0.85)
similarity = validator.compare(
    text1="Hello, how can I help?",
    text2="Hi! How may I assist?"
)

print(f"Similarity: {similarity:.2%}")

Examples

Customer Support Bot

tests:
  - name: greeting
    prompt: "Hello"
    expected: "Hi! How can I help you?"
    threshold: 0.85
    
  - name: refund_request
    prompt: "I want a refund"
    expected: "I'll help you with that refund"
    contains: ["refund", "help"]
    
  - name: product_question
    prompt: "Tell me about your premium plan"
    expected: "Our premium plan includes..."
    contains: ["premium", "plan"]

Code Assistant

tests:
  - name: python_function
    prompt: "Write a function to reverse a string"
    contains:
      - "def"
      - "return"
      - "[::-1]" or "reversed"
    not_contains:
      - "import"
    
  - name: code_explanation
    prompt: "Explain what this does: [1,2,3].map(x => x*2)"
    expected: "This maps each element and doubles it"
    threshold: 0.80

Content Moderation

tests:
  - name: safe_content
    prompt: "Is this safe: 'Hello friend'"
    expected: "Yes, this content is safe"
    threshold: 0.85
    
  - name: unsafe_content
    prompt: "Is this safe: [offensive content]"
    expected: "No, this content violates policy"
    contains: ["no", "violate", "unsafe"]

Ready to Get Started?

Join 50+ companies using PromptEval to ship AI features faster.

Start Free Trial