ParsBench

Overview

ParsBench provides toolkits for benchmarking Large Language Models (LLMs) based on the Persian language. It includes various tasks for evaluating LLMs on different topics, benchmarking tools to compare multiple models and rank them, and an easy, fully customizable API for developers to create custom models, tasks, scores, and benchmarks.

Key Features

Variety of Tasks: Evaluate LLMs across various topics.
Benchmarking Tools: Compare and rank multiple models.
Customizable API: Create custom models, tasks, scores, and benchmarks with ease.

Motivation

I was trying to fine-tune an open-source LLM for the Persian language. I needed some evaluation to test the performance and utility of my LLM. It leads me to research and find this paper. It's great work that they prepared some datasets and evaluation methods to test on ChatGPT. They even shared their code in this repository.

So, I thought that I should build a handy framework that includes various tasks and datasets for evaluating LLMs based on the Persian language. I used some parts of their work (Datasets, Metrics, Basic prompt templates) in this library.

Example Notebooks

Benchmark Aya models:
Benchmark Ava models:
Benchmark Dorna models:
Benchmark MaralGPT models:

Contributing

Contributions are welcome! Please refer to the contribution guidelines for more information on how to contribute.

License

ParsBench is distributed under the Apache-2.0 license.

Contact Information

For support or questions, please contact: shahriarshm81@gmail.com Feel free to let me know if there are any additional details or changes you'd like to make!