Learn how to use the LLM testing playground features and tools
Set up your API keys to start testing LLMs. Our Bring Your Own Key (BYOK) system gives you complete control over your API keys and costs.
Features:
• Bring Your Own Key (BYOK) - No server-side API keys
• Local storage for key management
• Support for Anthropic (Claude) and OpenAI (GPT) models
• Visual key configuration interface
• Clear all keys functionality
• Security-first approach
In the playground sidebar, click the "Start Here" button at the top. This opens the API key configuration modal.
Add your Anthropic API key for Claude models and/or your OpenAI API key for GPT models. You can use one or both providers.
Click "Save Keys" to store your keys locally. You can now use all the playground features with your own API keys.
When you open the API key modal, it automatically loads and displays your currently configured keys with visual indicators.
You can update your keys anytime by clicking "Start Here" again. The modal will show your current keys for easy editing.
Use the "Clear All Keys" button in the modal to remove all stored API keys from your browser's local storage.
Test and compare multiple prompts simultaneously with a 4-pane text editor interface.
Features:
• 4 independent prompt editors
• Real-time prompt comparison
• Model parameter controls
• Automatic result scrolling
• TanStack Query integration
Enter different prompts in each of the 4 text editor panes. Each pane represents a separate prompt strategy.
Enter a test input that will be sent to all prompts for comparison.
Click "Test All Prompts" to send your test input to all prompts simultaneously and compare responses.
Test Retrieval-Augmented Generation with your documents and additional context.
Features:
• Upload .txt and .json files
• Drag & drop file support
• Additional context input
• Real-time document preview
• Combined document + context queries
Upload .txt or .json files using the file picker or drag & drop. Each document will be included in the RAG context.
Enter additional context or API data in the context text box. This will be combined with your documents.
Enter your question in the query text box and click "Test RAG" to get answers based on your documents and context.
Plain text documents with direct content extraction.
Structured data files with JSON validation.
Compare fine-tuned models with base models to evaluate the effectiveness of your training data and techniques. Test performance, consistency, and quality improvements across different scenarios.
Features:
• Side-by-side fine-tuned vs base model comparison
• Dataset management and validation tools
• Performance metrics and analysis dashboard
• Training data visualization and quality assessment
• A/B testing for fine-tuning effectiveness
• Cost-benefit analysis of fine-tuning projects
Prepare high-quality training data in the correct format for your target model. Include diverse examples that represent your specific use case and desired outputs.
Use the provider's fine-tuning API to train your model on the prepared dataset. Monitor training progress and validate results during the process.
Test your fine-tuned model against the base model using the same prompts and scenarios. Compare performance, quality, and consistency to measure improvement.
Fine-tune models for specific industries like legal, medical, financial, or technical documentation with specialized terminology.
Improve chatbot responses, customer service interactions, and conversational flows with custom training data.
Optimize for specific writing styles, formats, or content types like marketing copy, technical writing, or creative content.
Enhance models for specific data analysis tasks, report generation, or business intelligence applications.
Compare responses across different LLM providers and models to find the best solution for your specific use case.
Features:
• Multi-provider support (Anthropic & OpenAI)
• Side-by-side response comparison
• Token usage tracking
• Model selection interface
• Unified prompt testing
Choose 2-4 models from the available Anthropic and OpenAI options. You can mix and match providers for comprehensive comparison.
Enter a single prompt that will be sent to all selected models. The same prompt ensures fair comparison across providers.
View side-by-side responses from all models with token usage and performance metrics for each provider.
Compare response quality, speed, and token efficiency across different models.
Evaluate token usage and cost-effectiveness for different providers.
Test which models perform best for specific tasks or domains.
Compare Anthropic vs OpenAI models for your specific use cases.
Test and compare different model parameters like temperature, top_p, and penalties to find the optimal settings for your specific use case.
Features:
• Predefined parameter sets (Default, Creative, Focused, Precise)
• Interactive parameter editing with modal interface
• Temperature, top_p, frequency penalty, presence penalty controls
• Side-by-side parameter comparison
• Real-time parameter adjustment
Choose 2-4 predefined parameter sets to compare. Each set has different temperature, top_p, and penalty values optimized for specific use cases.
Click the pencil icon next to any parameter set to open the editing modal. Adjust temperature, top_p, penalties, and max tokens as needed.
Write a prompt and test input, then run the test to see how different parameter settings affect the model's responses.
Balanced settings (temp: 0.7, top-p: 1.0) for general use cases with moderate creativity and consistency.
High creativity (temp: 0.9, top-p: 0.9) for brainstorming, creative writing, and innovative solutions.
Low randomness (temp: 0.3, top-p: 0.8) for analytical tasks, factual responses, and consistent outputs.
Very controlled (temp: 0.1, top-p: 0.7) for precise answers, technical documentation, and deterministic responses.
Find the best parameter settings for your specific tasks and use cases through systematic testing.
Understand how different parameters affect response quality, creativity, and consistency.
Fine-tune model behavior for specific applications like creative writing, technical analysis, or customer support.
Compare parameter effects on response time, token usage, and overall performance metrics.