📚 Playground Documentation

Learn how to use the LLM testing playground features and tools

Table of Contents

Getting Started with BYOK

Set up your API keys to start testing LLMs. Our Bring Your Own Key (BYOK) system gives you complete control over your API keys and costs.

Features:
• Bring Your Own Key (BYOK) - No server-side API keys
• Local storage for key management
• Support for Anthropic (Claude) and OpenAI (GPT) models
• Visual key configuration interface
• Clear all keys functionality
• Security-first approach

How to Set Up Your API Keys

1. Click "Start Here"

In the playground sidebar, click the "Start Here" button at the top. This opens the API key configuration modal.

2. Enter Your API Keys

Add your Anthropic API key for Claude models and/or your OpenAI API key for GPT models. You can use one or both providers.

3. Save and Start Testing

Click "Save Keys" to store your keys locally. You can now use all the playground features with your own API keys.

API Key Sources

🤖 Anthropic (Claude)

Get your API key from the Anthropic Console:

console.anthropic.com →

⚡ OpenAI (GPT)

Get your API key from the OpenAI Platform:

platform.openai.com/api-keys →

Security & Privacy

🔒 Your Keys, Your Control

  • • API keys are stored in your browser's local storage only
  • • Keys are never sent to our servers or stored on our systems
  • • You have complete control over your keys and can clear them anytime
  • • Keys are only used to make direct API calls to the respective providers
  • • No server-side API key fallbacks or caching

Managing Your Keys

📝 View Current Keys

When you open the API key modal, it automatically loads and displays your currently configured keys with visual indicators.

🔄 Update Keys

You can update your keys anytime by clicking "Start Here" again. The modal will show your current keys for easy editing.

🗑️ Clear All Keys

Use the "Clear All Keys" button in the modal to remove all stored API keys from your browser's local storage.

Side-by-Side Prompt Testing

Test and compare multiple prompts simultaneously with a 4-pane text editor interface.

Features:
• 4 independent prompt editors
• Real-time prompt comparison
• Model parameter controls
• Automatic result scrolling
• TanStack Query integration

How to Use

1. Write Prompts

Enter different prompts in each of the 4 text editor panes. Each pane represents a separate prompt strategy.

2. Set Test Input

Enter a test input that will be sent to all prompts for comparison.

3. Test All Prompts

Click "Test All Prompts" to send your test input to all prompts simultaneously and compare responses.

Document-Based RAG Testing

Test Retrieval-Augmented Generation with your documents and additional context.

Features:
• Upload .txt and .json files
• Drag & drop file support
• Additional context input
• Real-time document preview
• Combined document + context queries

How to Use

1. Upload Documents

Upload .txt or .json files using the file picker or drag & drop. Each document will be included in the RAG context.

2. Add Context (Optional)

Enter additional context or API data in the context text box. This will be combined with your documents.

3. Ask Questions

Enter your question in the query text box and click "Test RAG" to get answers based on your documents and context.

Supported File Types

📄 Text Files (.txt)

Plain text documents with direct content extraction.

📊 JSON Files (.json)

Structured data files with JSON validation.

Model Fine-Tuning Evaluation

Compare fine-tuned models with base models to evaluate the effectiveness of your training data and techniques. Test performance, consistency, and quality improvements across different scenarios.

Features:
• Side-by-side fine-tuned vs base model comparison
• Dataset management and validation tools
• Performance metrics and analysis dashboard
• Training data visualization and quality assessment
• A/B testing for fine-tuning effectiveness
• Cost-benefit analysis of fine-tuning projects

How Fine-Tuning Works

1. Dataset Preparation

Prepare high-quality training data in the correct format for your target model. Include diverse examples that represent your specific use case and desired outputs.

2. Model Training

Use the provider's fine-tuning API to train your model on the prepared dataset. Monitor training progress and validate results during the process.

3. Evaluation & Comparison

Test your fine-tuned model against the base model using the same prompts and scenarios. Compare performance, quality, and consistency to measure improvement.

Supported Fine-Tuning Providers

🤖 Anthropic Fine-Tuning

  • • Claude 3.5 Sonnet fine-tuning
  • • Custom training data upload
  • • Training progress monitoring
  • • Model performance analytics
  • • Cost tracking and optimization

⚡ OpenAI Fine-Tuning

  • • GPT-4o and GPT-3.5 Turbo fine-tuning
  • • JSONL dataset format support
  • • Hyperparameter optimization
  • • Training job management
  • • Model deployment tools

Evaluation Metrics

📊 Performance Metrics

  • • Response quality comparison
  • • Consistency across test cases
  • • Token usage efficiency
  • • Response time analysis
  • • Cost per response tracking

🎯 Quality Assessment

  • • Relevance to training data
  • • Adherence to desired format
  • • Domain-specific accuracy
  • • Hallucination detection
  • • Bias and safety evaluation

Best Practices

📝 Dataset Quality

  • • Use high-quality, diverse training examples
  • • Ensure consistent formatting and structure
  • • Include edge cases and error scenarios
  • • Validate data quality before training
  • • Balance dataset size with quality

🔧 Training Strategy

  • • Start with smaller datasets for validation
  • • Use appropriate learning rates and epochs
  • • Monitor for overfitting during training
  • • Implement early stopping mechanisms
  • • Test on held-out validation data

📈 Evaluation Process

  • • Test on diverse, real-world scenarios
  • • Compare against multiple baseline models
  • • Measure both quantitative and qualitative metrics
  • • Consider cost-benefit analysis
  • • Iterate based on evaluation results

Use Cases

🎯 Domain-Specific Tasks

Fine-tune models for specific industries like legal, medical, financial, or technical documentation with specialized terminology.

💬 Conversational AI

Improve chatbot responses, customer service interactions, and conversational flows with custom training data.

📝 Content Generation

Optimize for specific writing styles, formats, or content types like marketing copy, technical writing, or creative content.

🔍 Data Analysis

Enhance models for specific data analysis tasks, report generation, or business intelligence applications.

🚀 Coming Soon Features

Advanced Analytics
  • • Training progress visualization
  • • Performance trend analysis
  • • Cost optimization recommendations
  • • Model comparison dashboards
Evaluation Tools
  • • Automated quality assessment
  • • A/B testing frameworks
  • • Bias detection algorithms
  • • Safety evaluation metrics

Cross-Provider Model Testing

Compare responses across different LLM providers and models to find the best solution for your specific use case.

Features:
• Multi-provider support (Anthropic & OpenAI)
• Side-by-side response comparison
• Token usage tracking
• Model selection interface
• Unified prompt testing

Available Models

🤖 Anthropic Models

  • • Claude 3.5 Sonnet - Most capable model
  • • Claude 3 Haiku - Fast and efficient
  • • Claude 3 Opus - Most powerful model
  • • Claude 3 Sonnet - Balanced performance

⚡ OpenAI Models

  • • GPT-4o - Latest and most capable
  • • GPT-4o Mini - Fast and cost-effective
  • • GPT-4 Turbo - High performance
  • • GPT-3.5 Turbo - Fast and reliable

How to Use

1. Select Models

Choose 2-4 models from the available Anthropic and OpenAI options. You can mix and match providers for comprehensive comparison.

2. Write Your Prompt

Enter a single prompt that will be sent to all selected models. The same prompt ensures fair comparison across providers.

3. Compare Results

View side-by-side responses from all models with token usage and performance metrics for each provider.

Use Cases

🎯 Performance Testing

Compare response quality, speed, and token efficiency across different models.

💰 Cost Analysis

Evaluate token usage and cost-effectiveness for different providers.

🔍 Capability Testing

Test which models perform best for specific tasks or domains.

📊 Provider Comparison

Compare Anthropic vs OpenAI models for your specific use cases.

Model Parameter Optimization

Test and compare different model parameters like temperature, top_p, and penalties to find the optimal settings for your specific use case.

Features:
• Predefined parameter sets (Default, Creative, Focused, Precise)
• Interactive parameter editing with modal interface
• Temperature, top_p, frequency penalty, presence penalty controls
• Side-by-side parameter comparison
• Real-time parameter adjustment

How to Use

1. Select Parameter Sets

Choose 2-4 predefined parameter sets to compare. Each set has different temperature, top_p, and penalty values optimized for specific use cases.

2. Customize Parameters

Click the pencil icon next to any parameter set to open the editing modal. Adjust temperature, top_p, penalties, and max tokens as needed.

3. Test and Compare

Write a prompt and test input, then run the test to see how different parameter settings affect the model's responses.

Parameter Sets

🎯 Default

Balanced settings (temp: 0.7, top-p: 1.0) for general use cases with moderate creativity and consistency.

🎨 Creative

High creativity (temp: 0.9, top-p: 0.9) for brainstorming, creative writing, and innovative solutions.

🔍 Focused

Low randomness (temp: 0.3, top-p: 0.8) for analytical tasks, factual responses, and consistent outputs.

⚡ Precise

Very controlled (temp: 0.1, top-p: 0.7) for precise answers, technical documentation, and deterministic responses.

Use Cases

🎯 Parameter Optimization

Find the best parameter settings for your specific tasks and use cases through systematic testing.

📊 Response Analysis

Understand how different parameters affect response quality, creativity, and consistency.

🔧 Model Tuning

Fine-tune model behavior for specific applications like creative writing, technical analysis, or customer support.

📈 Performance Testing

Compare parameter effects on response time, token usage, and overall performance metrics.