Vol. 5 No. 11 (2025)
Articles

Performance‑Aware Fine‑Tuning and Inference Optimization for Large Language Model Code Generation: A Unified Framework

John R. Matthews
Department of Computer Science, Global Institute of Technology, London, United Kingdom

Published 2025-11-30

Keywords

  • Large Language Models,
  • code generation,
  • fine‑tuning,
  • inference optimization

How to Cite

John R. Matthews. (2025). Performance‑Aware Fine‑Tuning and Inference Optimization for Large Language Model Code Generation: A Unified Framework. Stanford Database Library of American Journal of Applied Science and Technology, 5(11), 277–282. Retrieved from https://oscarpubhouse.com/index.php/sdlajast/article/view/64

Abstract

As Large Language Models (LLMs) become increasingly central to code generation, there arises a critical need to not only improve the syntactic and semantic correctness of generated code but also to optimize for performance metrics such as execution efficiency, inference latency, and overall responsiveness in practical deployment scenarios. This article presents a unified framework that integrates two complementary approaches: performance‑aware fine‑tuning of LLMs for code generation and system‑level inference optimization through scheduling and firmware‑level enhancements. Drawing from recent empirical advances in learning performance‑improving code edits (Shypula et al., 2023) and efficiency‑aware fine‑tuning methods (Huang et al., 2025), we design a fine‑tuning pipeline that emphasizes generation of code optimized for run-time performance, while avoiding degradation of correctness. Concurrently, inspired by scheduling and preemption strategies for inference serving (Kim et al., 2024) and firmware‑level optimization approaches (2025), we incorporate an inference serving infrastructure that reduces latency and improves throughput. Through a series of controlled experiments, we demonstrate that our approach yields code that runs 15–25% faster on common benchmarks compared to baseline models, without sacrificing functional correctness, and cuts average end‑user latency by up to 30% in batch inference settings. We analyze trade‑offs, limitations, and outline a research agenda for broader adoption. The results underscore the importance of co‑designing model fine‑tuning and system‑level serving strategies to achieve real‑world performance gains.

References

  1. Shypula, A. G.; Madaan, A.; Zeng, Y.; Alon, U.; Gardner, J. R.; Yang, Y.; Hashemi, M.; Neubig, G.; Ranganathan, P.; Bastani, O.; Yazdanbakhsh, A. Learning Performance‑Improving Code Edits. The Twelfth International Conference on Learning Representations, Oct. 2023.
  2. Huang, D.; Zeng, G.; Dai, J.; Luo, M.; Weng, H.; Qing, Y.; Cui, H.; Guo, Z.; Zhang, J. M. SwiftCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning. arXiv, Mar. 2025.
  3. Kim, K.; et al. The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving. November 2024.
  4. Reducing Latency and Enhancing Accuracy in LLM Inference through Firmware-Level Optimization. International Journal of Signal Processing, Embedded Systems and VLSI Design, 2025, 5(2), 26–36.
  5. Feng, H.; Gao, Y. Ad Placement Optimization Algorithm Combined with Machine Learning in Internet E‑Commerce. 2025.
  6. Zhang, T.; Zhang, B.; Zhao, F.; et al. COVID‑19 Localization and Recognition on Chest Radiographs based on Yolov5 and EfficientNet. Proceedings of the 7th International Conference on Intelligent Computing and Signal Processing (ICSP), 2022, 1827–1830.
  7. Gao, Z.; Tian, Y.; Lin, S. C.; et al. A CT Image Classification Network Framework for Lung Tumors Based on Pre-trained MobileNetV2 Model and Transfer Learning, and Its Application and Market Analysis in the Medical Field. arXiv preprint arXiv:2501.04996, 2025.
  8. Wang, Y.; Jia, P.; Shu, Z.; et al. Multidimensional Precipitation Index Prediction Based on CNN‑LSTM Hybrid Framework. arXiv preprint arXiv:2504.20442, 2025.