Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

Overview of LLM-PeerReview

Overview of the LLM-PeerReview framework

The proposed LLM-PeerReview contains three steps:
(1) Scoring: For a given query, after each LLM independently generates a response (analogous to a submitted academic paper), LLM-PeerReview applies the LLM-as-a-Judge technique (and the proposed flipped-triple scoring trick), treating each model as a reviewer to assign scores to all candidate responses;
(2) Reasoning: LLM-PeerReview then uses a truth inference algorithm—analogous to a senior reviewer—to estimate a final score for each response. (Notably, for the variant LLM-PeerReview-Weighted, the inference algorithm is performed using score information across all queries, allowing the model to learn each LLM’s scoring behavior using global information from the dataset, thereby enabling fine-grained, reliability-aware score aggregation);
(3) Selecting the best: Finally, for each query, LLM-PeerReview selects the response with the highest final score as the ensemble output—analogous to how a senior reviewer chooses the best paper from a specific submission pool.

Motivation / Introduction

The artificial intelligence domain has undergone a massive transformation recently, driven by the emergence of Large Language Models (LLMs) such as Gemini, GPT-4, Llama, and DeepSeek. The success of these models has triggered a surge in research activity, with over 182,000 models now available on Hugging Face.

Behind this research enthusiasm, we can observe two main points: 1) Persistent performance concerns: Although large language models can be easily deployed for zero-shot or in-context few-shot inference, they still face common performance issues, such as limited accuracy, hallucinations, and misalignment with human goals; 2) The varying strengths and weaknesses of LLMs: These models display significant behavioral differences, primarily driven by variations in their architecture, scale, training data, dictionary, tokenization and methodology. Consequently, their responses to the same prompt often diverge. With the above two points in mind and inspired by the spirit of Ensemble Learning, it is reasonable to suggest that relying on a single LLM—even one with a high public ranking or other criteria—may not be the optimal strategy for every user query. Instead, it might be more advantageous to simultaneously consider multiple LLM candidates (which are usable out-of-the-box) and leverage their distinct strengths. This concept is the core focus of the burgeoning field of LLM Ensemble.

As LLM Ensemble gains increasing attention, one well-established class of solutions—ensemble-after-inference (also known as post-hoc ensemble) methods—has emerged. These methods include the following two representative approaches:

Selection-then-regeneration approaches rely heavily on curated task-specific data and require fine-tuning additional models, which severely restricts their generalization and adaptability across different domains.
Similarity-based selection approaches suffer from coarse-grained designs and naive strategies (e.g., using shallow metrics like BLEU), failing to fully utilize the deep semantic information required for optimal selection.

When we revisit this research problem, we ask the most fundamental question: In the real world, how would humans select the most ideal text from a set of candidate texts? Perhaps the most immediate and relatable real-world example is: the academic peer-review process. Motivated by this, we propose a new, fully unsupervised LLM Ensemble method called LLM-PeerReview.

Experiment Setup

Datasets and evaluation. We evaluate four widely-used datasets, grouped into three categories: (1) Factual Recall: TriviaQA evaluates the accuracy of model responses to factual questions across various domains, including history, science, and geography. (2) Arithmetic Reasoning: GSM8k and MATH assess basic arithmetic and more advanced mathematical reasoning, respectively, with accuracy as the evaluation metric, focusing on correct numerical answers. (3) Instruction Following: AlpacaEval tests models' ability to follow various instructions. We use GPT-4o-mini to evaluate the accuracy of model responses, assessing whether the model’s response exceeds the reference answer in the dataset.

Baselines. We compare the proposed LLM-PeerReview with the two categories of baselines. (1) Single LLMs: The four 7B-scale models, Llama-3.1-8B-Instruct, Mistral-7B-Instruct, Qwen2-7B-Instruct, and Qwen2.5-7B-Instruct. (2) LLM Ensemble baselines: (i) Random is a random-selection baseline that simply returns the response from a randomly chosen LLM in the ensemble. As one of the simplest ensemble strategies for large language models, this method has previously been applied to dialogue tasks; (ii) Smoothie-Global, Smoothie-Local, and Agent-Forest are recently proposed, strong similarity-based ensemble methods, as introduced in detail in Section 1; (iii) GaC is a representative token-level ensemble-during-inference approach. It constructs a unified vocabulary that merges the individual dictionaries of multiple LLMs. During inference, token sampling is performed by observing the output distributions from these models across the unified vocabulary.

Results

Through comprehensive experiments, we reached the following conclusions:

Our results indicate that both of our variants consistently outperform any single LLM and all LLM Ensemble baselines across all datasets. Specifically, regarding average performance, our two variant methods (with results of 67.4% and 67.8%) surpass the strongest single model, Qwen2.5, by 4.7% and 5.1%, respectively, and outperform the strongest ensemble method, Smoothie-Global, by 6.9% and 7.3%. These results directly demonstrate the effectiveness of our method, as it achieves superior performance by integrating the collective knowledge of multiple models across factual-recall QA tasks, mathematical reasoning tasks, and instruction-following tasks. Additionally, the ensemble task across these four datasets is challenging, as the performance of the four LLMs varies significantly for each dataset. In contrast, ensembling four LLMs with similar performance would make it easier to achieve superior results compared to any single LLM.

Analysis of individual LLM performance (such as through radar charts or win-tie-loss comparisons on the challenging AlpacaEval dataset) highlights that models with the best overall performance may underperform on specific tasks compared to those with weaker overall results. In summary, our experimental data demonstrates that a strong LLM does not excel across all datasets. Each model has its strengths and weaknesses, highlighting the substantial practical significance of LLM Ensemble.

When evaluating the performance of using a single LLM as a judge to select the optimal response, we observe that these variants perform quite well (surpassing the overall best model, Qwen2.5, in 3/4 cases). However, when comparing the performance of these variants with that of our prototype LLM-PeerReview-Average, it becomes clear that aggregating and averaging the scores from multiple judges is highly beneficial, compared to relying solely on the score of a single large model.

We find that LLM-PeerReview-Weighted leads to further performance gains compared to simple averaging. Furthermore, subtle variations in the learned transition matrices for each model, combined with positive correlation coefficients observed in our analysis, demonstrate that our method can effectively identify stronger and weaker judges.

It is intuitive that, in addition to our recommended flipped-triple scoring method, several variant scoring methods could be employed. Overall, the performance of these four variants follows the order: quadruple-half > flipped-triple > double > single. Variants quadruple-half, flipped-triple, and double all offer noticeable de-biasing performance advantages over the single-scoring strategy. On the other hand, in terms of theoretical computational complexity, the complexities for de-biased strategies double/flipped-triple/quadruple-half are O(J)/O(J²)/O(J)/O(J!), with strategy flipped-triple having the lowest computational complexity. Furthermore, regarding scoring efficiency, compared to strategies double and quadruple-half, strategy flipped-triple is the most time-efficient.

We conducted a further analysis of how different scoring levels influence the performance of both our method and the four individual scoring models. For each scoring level, we have carefully crafted meaningful descriptions and corresponding prompts. Using the basic variant LLM-PeerReview-Average for this analysis, our method exhibits slightly varying performance, showing no consistent tendencies across the levels of 3, 5, 7, and 10.

Main Results.

LLM performances. (bottom: AlpacaEval)

Left: The transition matrix of each LLM estimated by LLM-PeerReview-Weighted. Right: Correlation between matrix diagonal information of each LLM and its performance as a single judge. (corresponding to "our variants").

Top: Performance of the base variant LLM-PeerReview-Average with different scoring strategies. Bottom: Computation efficiency.

Performance of various variants across different scoring levels.

BibTeX


@misc{chen2025scoringreasoningselectingbest,
      title={Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process}, 
      author={Zhijun Chen and Zeyu Ji and Qianren Mao and Hao Wu and Junhang Cheng and Bangjie Qin and Zhuoran Li and Jingzheng Li and Kai Sun and Zizhe Wang and Yikun Ban and Zhu Sun and Xiangyang Ji and Hailong Sun},
      year={2025},
      eprint={2512.23213},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.23213}, 
}