Benchmarks

Using Benchmarks you can evaluate different Models based on different Tasks and compare their score.

Custom Benchmark

You can easily create a benchmark with your desired tasks and models.

from transformers import AutoModelForCausalLM, AutoTokenizer

from parsbench.benchmarks import CustomBenchmark
from parsbench.models import OpenAIModel, PreTrainedTransformerModel
from parsbench.tasks import ParsiNLUMultipleChoice, PersianMath, ParsiNLUReadingComprehension

# Create Models
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")
qwen2_model = PreTrainedTransformerModel(model=model, tokenizer=tokenizer)

aya_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="aya:latest",
)

# Run Benchmark
benchmark = CustomBenchmark(
    models=[qwen2_model, aya_model],
    tasks=[
        ParsiNLUMultipleChoice,
        ParsiNLUReadingComprehension,
        PersianMath,
    ],
)
result = benchmark.run(
    prompt_lang="fa",
    prompt_shots=[0, 3],
    n_first=100,
    sort_by_score=True,
)

Full Benchmark

To benchmark your model based on all existing tasks in the framework. You can use load_all_tasks function.

from parsbench.benchmarks import CustomBenchmark
from parsbench.models import OpenAIModel
from parsbench.tasks.utils import load_all_tasks

aya_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="aya:latest",
)

# Run Benchmark
benchmark = CustomBenchmark(
    models=[aya_model],
    tasks=load_all_tasks(),
)
result = benchmark.run(
    prompt_lang="fa",
    prompt_shots=[0, 3],
    n_first=100,
    sort_by_score=True,
)

Benchmark Result

The benchmark result contains all evaluation results for each model. You can use it directly or convert it to a Pandas DataFrame with to_pandas function. If you want to get a pivot table of benchmark result, you should use to_pandas(pivot=True).

print(result.to_pandas(pivot=True))

Output should be like:

                                                                                     score          
model_name                                                                     qwen2:latest          
n_shots                                                                                   0         3
task_category task_name                      sub_task         score_name                             
classic       ParsiNLU Reading Comprehension NaN              Common Tokens         0.46231  0.588274
knowledge     PersiNLU Multiple Choice       common_knowledge Exact Match           0.30000  0.000000
                                             literature       Exact Match           0.20000  0.428571
                                             math_and_logic   Exact Match           0.60000  0.285714
math          Persian Math                   NaN              Math Equivalence      0.00000  0.142857

Note: It would look better if you run it in a Jupyter Notebook.

Radar Plot (Spider Plot)

For a better comparison between models performance on different tasks. You can use show_radar_plot to visualize the benchmark.

result.show_radar_plot()

Output should be like:

Benchmark Bar Plot

Save Result

To save the matches, evaluations and benchmark results during the benchmarking process, you can set save_matches, save_evaluation and save_benchmark.

benchmark = CustomBenchmark(
    models=[aya_model, qwen2_model],
    tasks=[PersianMath, FarsTailEntailment],
)
result = benchmark.run(
    prompt_lang="fa",
    prompt_shots=[0, 5],
    n_first=100,
    save_matches=True,
    save_evaluation=True,
    save_benchmark=True,
    output_path="results",
    sort_by_score=True,
)

The output directory structure should be like this:

results
├── aya:latest
│   ├── FarsTail_Entailment
│   │   ├── evaluation.jsonl
│   │   ├── matches_0_shot.jsonl
│   │   └── matches_5_shot.jsonl
│   └── Persian_Math
│       ├── evaluation.jsonl
│       ├── matches_0_shot.jsonl
│       └── matches_5_shot.jsonl
├── qwen2:latest
│   ├── FarsTail_Entailment
│   │   ├── evaluation.jsonl
│   │   ├── matches_0_shot.jsonl
│   │   └── matches_5_shot.jsonl
│   └── Persian_Math
│       ├── evaluation.jsonl
│       ├── matches_0_shot.jsonl
│       └── matches_5_shot.jsonl
└── benchmark.jsonl