Hi! I wanted to share LangCheck, an open source toolkit to evaluate LLM applications (GitHub, Quickstart).
It already supports English and Japanese text, and more languages soon – contributions welcome!
Core functionality:
langcheck.metrics
– metrics to evaluate quality & structure of LLM-generated text
langcheck.plot
– interactive visualizations of text quality
langcheck.augment
– text augmentations to perturb prompts, references, etc (coming soon)
Super open to feedback & curious how other people think about evaluation for LLM apps.
If you're open to using an open source library, you can use LangCheck to monitor and visualize text quality metrics in production.
For example, you can compute & plot toxicity of users prompts and LLM responses from your logs. (A very simple example here.)
(Disclaimer: I'm one of the contributors of LangCheck)