From the course: Applied AI: Building NLP Apps with Hugging Face Transformers

Evaluating Qu-An performance

From the course: Applied AI: Building NLP Apps with Hugging Face Transformers

Evaluating Qu-An performance

- [Instructor] Let's use the SQuAD metrics in Hugging Face to evaluate the performance of Qu-An. We first import the evaluate module, that is part of Hugging Face. We will then create a squad_metric object, using the load method. We specify squad_v2 as the metric to load. For the purposes of using the function, we will forego the actual inference process, and instead use sample predictions and real answers. We will use one correct answer, Paris, and three possible predictions, Paris, London, and Paris is one of the best cities in the world. To use the squad_metric, we need to create the predictions dictionary, and the references dictionary in the format shown here. We will do evaluation individually, as well as cumulatively, using the squad_metric.compute method returns the evaluation. We extract just the F1 score and print to the console. Then we will also perform cumulative evaluation, and print the numbers. Let's run this code now. First, let us look at individual numbers. For the exact match answer, Paris, we get an F1 score of 100. For a no match answer of London, we get a score of zero. For a partial match, where the word Paris is part of a long sentence, we get an F1 score of 22. This gives an idea of how the scoring works. Next, we look at the cumulative metrics across all three predictions. The exact number shows the percentage of exact matches. The F1 score is the average score across all predictions. It then provides the count of answers, and how many of them have exact answers. This provides an overview of the performance of the model across a large evaluation dataset.

Contents