Meta working on a Self-Taught Evaluator for LLMs | InfoWorld
Facebook parent Meta’s AI research team is working on developing what it calls a Self-Taught Evaluator for large language models (LLMs) that could help enterprises reduce their time and human resource requirements while developing custom LLMs. Earlier this month, the social media giant’s AI research team, dubbed Meta FAIR, published a paper on the technology, which claims that these evaluators could help an LLM create its own training data — synthetic data — for evaluation purposes.
Advantages of Self-Taught Evaluator
Typically, models that are used as evaluators, known as LLM-as-a-Judge, are trained with large amounts of data annotated by humans, which is a costly affair, and the data becomes stale as the model improves, the researchers explained in the paper. Human annotation of data is required or preferred over LLM responses, as the latter still cannot always successfully resolve challenging tasks such as coding or mathematics problems, the researchers further said, adding that this dependency on human-generated data poses significant challenges for scaling to new tasks or evaluation criteria.
The researchers used only synthetic data generated by an LLM in an iterative manner, without the need for labeling instructions.

Training Process and Results
Starting from unlabeled instructions, the iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. The researchers started with a seed model and used prompt engineering to generate contrasting synthetic preference pairs for a given input. The model as an LLM-as-a-Judge was then used to generate reasoning traces and judgments for these pairs, which they could label as correct or not given the synthetic preference pair design. Through this process, they obtained a superior LLM-as-a-Judge that self-improves.
In their experiments, the researchers at Meta claimed that without any labeled preference data, the Self-Taught Evaluator improved Llama3-70B-Instruct’s score on the RewardBench benchmarking tool significantly.
Limitations and Considerations
Despite the promising results, the researchers acknowledged some limitations of their approach. They did not test it on smaller models and did not consider computational requirement concerns, only accuracy. They also highlighted that generative LLM-as-a-Judge models usually have longer outputs and higher inference cost than reward models.

Additionally, the researchers mentioned that the approach is limited by the assumption of having a capable instruction fine-tuned model aligned to human or policy preferences from the start.