Evaluating Textual content Era in Massive Language Fashions | by Mina Ghashami | Jan, 2024


Metrics to measure the hole between neural textual content and human textual content

Picture from

Lately, massive language fashions have proven large skill in producing human-like texts. There are various metrics to measure how shut/related a textual content generated by massive language fashions is to the reference human textual content. In truth, bridging this hole is an energetic space of analysis.

On this submit, we glance into two well-known metrics for robotically evaluating the machine generated texts.

Contemplate you might be given a reference textual content that’s human-generated, and a machine-generated textual content that’s generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the picture beneath:

Picture from [1]

Right here the reference textual content is “the climate is chilly at the moment” and the candidate textual content which is machine generated is “it’s freezing at the moment”. If we compute the n-gram similarity these two texts can have a low rating. Nonetheless, we all know they’re semantically very related. So BERTScore computes the contextual embedding of every token in each reference textual content and the candidate textual content and the primarily based on these embedding vectors, it computes the pairwise cosine similarities.

Picture from [1]

Based mostly on pairwise cosine similarities, we will compute precision, recall and F1 rating. To take action as following:

  • Recall: we get the utmost cosine similarity for each token within the reference textual content and get their common
  • Precision: we get the utmost cosine similarity for each token within the candidate textual content and get their common
  • F1 rating: the harmonic imply of precision and recall

BERTScore[1] additionally suggest a modification to above rating referred to as as “significance weighting”. In “significance weighting” , considers the truth that uncommon phrase that are widespread between two sentences are extra…


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button