Tasks

Tasks are used to test and evaluate the Model responses. They come with a dataset of questions and expected answers. The task will generate prompts based on the prompt template and data, get the completion of that prompt using the Model, and then score it using the specified score.

Available Tasks

Task Name	Score Name	Dataset
ParsiNLU Sentiment Analysis	Exact Match (F1)	ParsiNLU
ParsiNLU Entailment	Exact Match (F1)	ParsiNLU
ParsiNLU Machine Translation En -> Fa	Bleu	ParsiNLU
ParsiNLU Machine Translation Fa -> En	Bleu	ParsiNLU
ParsiNLU Multiple Choice	Exact Match (Accuracy)	ParsiNLU
ParsiNLU Reading Comprehension	Common Tokens (F1)	ParsiNLU
Persian NER	NER Exact Match (F1)	PersianNER
Persian Math	Math Equivalence (Accuracy)	Source
ConjNLI Entailment	Exact Match (F1)	Source
Persian MMLU (Khayyam Challenge)	Exact Match (Accuracy)	Khayyam Challenge
FarsTail Entailment	Exact Match (F1)	FarsTail
Persian News Summary	Rouge	PNSummary
XL-Sum	Rouge	XLSum

You can import the class of above tasks from parsbench.tasks and use it for evaluating your model.

Evaluation

The evaluation process has 6 steps:

Loading Data
Loading Prompt Template
Generating Matches (Prompt-Answer)
Generating Completions
Scoring Completions
Storing Result (Optional)

Here is an example of evaluating a pre-trained model on PersianMath:

from transformers import AutoModelForCausalLM, AutoTokenizer

from parsbench.models import PreTrainedTransformerModel
from parsbench.tasks import ParsiNLUMultipleChoice

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")

tf_model = PreTrainedTransformerModel(model=model, tokenizer=tokenizer)

with ParsiNLUMultipleChoice() as task:
    results = task.evaluate(
        model=tf_model,
        prompt_lang="fa",
        prompt_shots=[0, 5],
    )

You should use the task in a context manager. It manages data loading and offloading for a better performance.

Evaluation Result

The evaluate function of a task will return a list of EvaluationResult data classe which contains the overall score for each sub task and n_shot prompt of the task.

You can directly use the class or convert it to a Pandas DataFrame with to_pandas function.

eval_result = results[0]
print(eval_result.to_pandas())

Output:

     model_name                 task_name task_category        sub_task  n_shots   score_name     score
0  qwen2:latest  PersiNLU Multiple Choice     knowladge  math_and_logic        0  Exact Match  0.600000
1  qwen2:latest  PersiNLU Multiple Choice     knowladge  math_and_logic        3  Exact Match  0.285714

Save Result

You can manually save the result using save function of the EvaluationResult or pass save_evaluation=True to the evaluate function.

You can also save task matches which contains prompt, completion, target, and score by passing save_matches=True to the evaluate function.

with PersianMath() as task:
    results = task.evaluate(
        model=tf_model,
        prompt_lang="fa",
        prompt_shots=[0, 5],
        save_matches=True,
        save_evaluation=True,
        output_path="results/",
    )

The output directory structure should be like this:

results
└── qwen2:latest
    └── Persian_Math
        ├── evaluation.jsonl
        ├── matches_0_shot.jsonl
        └── matches_5_shot.jsonl