How to Use Large Language Models While Reducing Cost and Improving Performance
Researchers at Stanford have proposed a method called FrugalGPT to harness the power of large language models while significantly reducing their inference cost. It can potentially match GPT-4’s performance while reducing cost by 98%.
Some highlights:
- There is rapidly growing number of large language models (LLMs) available as commercial APIs. Using these APIs can be very expensive, ranging from $700k per day to $21k per month for some use cases.
- The cost of using different LLM APIs varies significantly, up to two orders of magnitude. For example, processing 10 million input tokens with GPT-J costs $0.2 while GPT-4 costs $30.
- The paper proposes three strategies to reduce the cost of using LLMs while maintaining performance:
- Prompt adaptation: Using shorter prompts to reduce input length and save cost. This includes prompt selection (using only relevant examples) and query concatenation (aggregating multiple queries into one prompt).
- LLM approximation: Approximating expensive LLMs with smaller, cheaper models for specific tasks. This includes caching previously generated answers and fine-tuning cheap models with answers from expensive LLMs.
- LLM cascade: Selectively choosing which LLM APIs to use for different queries based on cost and reliability. Cheaper LLMs are queried first, reserving expensive ones only for “hard” queries.
- FrugalGPT’s key technique is LLM cascade. In experiments:
- It matched GPT-4’s performance while reducing cost by an astonishing 98%!
- It improved accuracy over GPT-4 by 4% with the same cost.
- By composing strategies, greater gains are possible.
- In a simple cascade, more affordable APIs like GPT-J and J1-L answer most queries, while reserving GPT-4 for the hardest queries. This cuts costs drastically while maintaining performance.
Leave a comment