Enhancing Best-of-N Decoding by Speculative Rejection and Self-Certainty

Jose Lamir Gouvéa Junior; Luan Fonseca Garcia; Ewerton de Oliveira; Thomas Paula

doi:10.5753/eramiars.2025.16649

Jose Lamir Gouvéa Junior NAIA / PUCRS
Luan Fonseca Garcia NAIA / PUCRS
Ewerton de Oliveira Brazil R&D - HP Inc.
Thomas Paula Brazil R&D - HP Inc.

DOI: https://doi.org/10.5753/eramiars.2025.16649

Resumo

Controllable text generation techniques such as fine-tuning, reinforcement learning, and prompt engineering have significant potential to enhance reasoning, alignment, and efficiency in Large Language Models. However, these methods often struggle with memory management, generalization across diverse language tasks, and score function design. In contrast, enhancing the decoding process has proven to be an effective way to control generation without requiring additional training or external tools. This work proposes an improved parallel decoding strategy that not only alleviates resource requirements but also effectively leverages its guiding reward function.

Referências

Beirami, A., Agarwal, A., Berant, J., D’Amour, A., Eisenstein, J., Nagpal, C., and Suresh, A. T. (2025). Theoretical guarantees on the best-of-n alignment policy.

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. (2024). Alpacafarm: A simulation framework for methods that learn from human feedback.

Kang, Z., Zhao, X., and Song, D. (2025). Scalable best-of-n selection for large language models via self-certainty.

Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast inference from transformers via speculative decoding.

Snell, C., Lee, J., Xu, K., and Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters.

Sun, H., Haider, M., Zhang, R., Yang, H., Qiu, J., Yin, M., Wang, M., Bartlett, P., and Zanette, A. (2024). Fast best-of-n decoding via speculative rejection.

Turner, R. E. (2024). An introduction to transformers.

Wang, H. and Shu, K. (2025). Make every token count: A systematic survey on decoding methods for foundation models.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models.

Wang, Y., Zhang, P., Huang, S., Yang, B., Zhang, Z., Huang, F., and Wang, R. (2025). Sampling-efficient test-time scaling: Self-estimating the best-of-n sampling in early decoding.