Enhancing Best-of-N Decoding by Speculative Rejection and Self-Certainty
Resumo
Controllable text generation techniques such as fine-tuning, reinforcement learning, and prompt engineering have significant potential to enhance reasoning, alignment, and efficiency in Large Language Models. However, these methods often struggle with memory management, generalization across diverse language tasks, and score function design. In contrast, enhancing the decoding process has proven to be an effective way to control generation without requiring additional training or external tools. This work proposes an improved parallel decoding strategy that not only alleviates resource requirements but also effectively leverages its guiding reward function.Referências
Beirami, A., Agarwal, A., Berant, J., D’Amour, A., Eisenstein, J., Nagpal, C., and Suresh, A. T. (2025). Theoretical guarantees on the best-of-n alignment policy.
Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. (2024). Alpacafarm: A simulation framework for methods that learn from human feedback.
Kang, Z., Zhao, X., and Song, D. (2025). Scalable best-of-n selection for large language models via self-certainty.
Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast inference from transformers via speculative decoding.
Snell, C., Lee, J., Xu, K., and Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters.
Sun, H., Haider, M., Zhang, R., Yang, H., Qiu, J., Yin, M., Wang, M., Bartlett, P., and Zanette, A. (2024). Fast best-of-n decoding via speculative rejection.
Turner, R. E. (2024). An introduction to transformers.
Wang, H. and Shu, K. (2025). Make every token count: A systematic survey on decoding methods for foundation models.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models.
Wang, Y., Zhang, P., Huang, S., Yang, B., Zhang, Z., Huang, F., and Wang, R. (2025). Sampling-efficient test-time scaling: Self-estimating the best-of-n sampling in early decoding.
Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. (2024). Alpacafarm: A simulation framework for methods that learn from human feedback.
Kang, Z., Zhao, X., and Song, D. (2025). Scalable best-of-n selection for large language models via self-certainty.
Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast inference from transformers via speculative decoding.
Snell, C., Lee, J., Xu, K., and Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters.
Sun, H., Haider, M., Zhang, R., Yang, H., Qiu, J., Yin, M., Wang, M., Bartlett, P., and Zanette, A. (2024). Fast best-of-n decoding via speculative rejection.
Turner, R. E. (2024). An introduction to transformers.
Wang, H. and Shu, K. (2025). Make every token count: A systematic survey on decoding methods for foundation models.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models.
Wang, Y., Zhang, P., Huang, S., Yang, B., Zhang, Z., Huang, F., and Wang, R. (2025). Sampling-efficient test-time scaling: Self-estimating the best-of-n sampling in early decoding.
Publicado
12/11/2025
Como Citar
GOUVÉA JUNIOR, Jose Lamir; GARCIA, Luan Fonseca; OLIVEIRA, Ewerton de; PAULA, Thomas.
Enhancing Best-of-N Decoding by Speculative Rejection and Self-Certainty. In: ESCOLA REGIONAL DE APRENDIZADO DE MÁQUINA E INTELIGÊNCIA ARTIFICIAL DA REGIÃO SUL (ERAMIA-RS), 1. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 284-287.
DOI: https://doi.org/10.5753/eramiars.2025.16649.