We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
The proposed LLM-PeerReview contains three steps:
(1) Scoring: For a given query, after each LLM independently generates a response (analogous to a submitted academic paper), LLM-PeerReview applies the LLM-as-a-Judge technique (and the proposed flipped-triple scoring trick), treating each model as a reviewer to assign scores to all candidate responses;
(2) Reasoning: LLM-PeerReview then uses a truth inference algorithm—analogous to a senior reviewer—to estimate a final score for each response. (Notably, for the variant LLM-PeerReview-Weighted, the inference algorithm is performed using score information across all queries, allowing the model to learn each LLM’s scoring behavior using global information from the dataset, thereby enabling fine-grained, reliability-aware score aggregation);
(3) Selecting the best: Finally, for each query, LLM-PeerReview selects the response with the highest final score as the ensemble output—analogous to how a senior reviewer chooses the best paper from a specific submission pool.
The artificial intelligence domain has undergone a massive transformation recently, driven by the emergence of Large Language Models (LLMs) such as Gemini, GPT-4, Llama, and DeepSeek. The success of these models has triggered a surge in research activity, with over 182,000 models now available on Hugging Face.
Behind this research enthusiasm, we can observe two main points: 1) Persistent performance concerns: Although large language models can be easily deployed for zero-shot or in-context few-shot inference, they still face common performance issues, such as limited accuracy, hallucinations, and misalignment with human goals; 2) The varying strengths and weaknesses of LLMs: These models display significant behavioral differences, primarily driven by variations in their architecture, scale, training data, dictionary, tokenization and methodology. Consequently, their responses to the same prompt often diverge. With the above two points in mind and inspired by the spirit of Ensemble Learning, it is reasonable to suggest that relying on a single LLM—even one with a high public ranking or other criteria—may not be the optimal strategy for every user query. Instead, it might be more advantageous to simultaneously consider multiple LLM candidates (which are usable out-of-the-box) and leverage their distinct strengths. This concept is the core focus of the burgeoning field of LLM Ensemble.
As LLM Ensemble gains increasing attention, one well-established class of solutions—ensemble-after-inference (also known as post-hoc ensemble) methods—has emerged. These methods include the following two representative approaches:
When we revisit this research problem, we ask the most fundamental question: In the real world, how would humans select the most ideal text from a set of candidate texts? Perhaps the most immediate and relatable real-world example is: the academic peer-review process. Motivated by this, we propose a new, fully unsupervised LLM Ensemble method called LLM-PeerReview.
Datasets and evaluation. We evaluate four widely-used datasets, grouped into three categories: (1) Factual Recall: TriviaQA evaluates the accuracy of model responses to factual questions across various domains, including history, science, and geography. (2) Arithmetic Reasoning: GSM8k and MATH assess basic arithmetic and more advanced mathematical reasoning, respectively, with accuracy as the evaluation metric, focusing on correct numerical answers. (3) Instruction Following: AlpacaEval tests models' ability to follow various instructions. We use GPT-4o-mini to evaluate the accuracy of model responses, assessing whether the model’s response exceeds the reference answer in the dataset.
Baselines. We compare the proposed LLM-PeerReview with the two categories of baselines. (1) Single LLMs: The four 7B-scale models, Llama-3.1-8B-Instruct, Mistral-7B-Instruct, Qwen2-7B-Instruct, and Qwen2.5-7B-Instruct. (2) LLM Ensemble baselines: (i) Random is a random-selection baseline that simply returns the response from a randomly chosen LLM in the ensemble. As one of the simplest ensemble strategies for large language models, this method has previously been applied to dialogue tasks; (ii) Smoothie-Global, Smoothie-Local, and Agent-Forest are recently proposed, strong similarity-based ensemble methods, as introduced in detail in Section 1; (iii) GaC is a representative token-level ensemble-during-inference approach. It constructs a unified vocabulary that merges the individual dictionaries of multiple LLMs. During inference, token sampling is performed by observing the output distributions from these models across the unified vocabulary.
Through comprehensive experiments, we reached the following conclusions:
@misc{chen2025scoringreasoningselectingbest,
title={Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process},
author={Zhijun Chen and Zeyu Ji and Qianren Mao and Hao Wu and Junhang Cheng and Bangjie Qin and Zhuoran Li and Jingzheng Li and Kai Sun and Zizhe Wang and Yikun Ban and Zhu Sun and Xiangyang Ji and Hailong Sun},
year={2025},
eprint={2512.23213},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.23213},
}