Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare
DOI:
https://doi.org/10.30953/bhty.v8.396Keywords:
Code generative AI, Gen AI in healthcare, large language models, LLM, rich data, self-evolving software, SES, software engineeringAbstract
This paper evaluates the critical trade-offs between "rich data" and "data quantity" approaches in Code Generation AI (Code Gen AI) and autonomous code agents, particularly in high-integrity sectors like healthcare. While Code Gen AI can enhance productivity by up to 55% in controlled environments, systems trained on unfiltered, large-scale datasets often increase code duplication, churn, and error rates. The paper demonstrates that in sectors where accuracy, auditability, and privacy are paramount, data richness consistently outperforms brute-force scaling strategies. Using Self-Evolving Software (SES) as a case study, the research contrasts outcomes from both paradigms and proposes a weighted matrix for data selection in Code Gen AI systems. The findings show that rich, curated, domain-specific datasets produce more reliable, compliant, and sustainable code with significantly reduced technical debt, particularly in regulated environments where quality and ethical considerations are essential. The paper concludes with best practice guidelines for implementing Code Gen AI in sensitive sectors, emphasizing that sustainable software engineering requires scaling ethically and intelligently rather than indiscriminately.
Downloads
References
GitClear. The Impact of AI on Code Duplication, Churn and Defects [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://arc.dev/talent-blog/impact-of-ai-on-code/
EU AI Act & GDPR Regulations [Internet]. [cited 2025 Apr 20]. Available from: https://artificialintelligenceact.eu/
Tantithamthavorn C, et al. Explainable AI for SE. IEEE Software. 2023.
Self-Evolving Software [Internet]. [cited 2025 Apr 20]. Available from: https://www.selfevolvingsoftware.com
Brundage M, et al. Lessons learned on language model safety and misuse. OpenAI; 2022, https://openai.com/index/language-model-safety-and-misuse/
SES Overview. Internal Whitepaper; 2024.
Nijkamp E, Zhao J, Poesia G, Xiong C. CodeGen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2303.17568. 2023.
Perry N, et al. Do users write more insecure code with AI assistants? arXiv preprint arXiv:2211.03622. 2022.
Bell E. Generative AI vs. Large Language Models (LLMs): What's the Difference? [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://appian.com/blog/acp/process-automation/generative-ai-vs-large-language-models
IBM. AI code-generation software: What it is and how it works? [Internet]. 2023 [cited 2025 Apr 20]. Available from: https://www.ibm.com/think/topics/ai-code-generation
American Psychological Association. Publication manual of the American Psychological Association. 7th ed. 2019. https://doi.org/10.1037/0000165-000
Bommasani R, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021.
Brown TB, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901.
Chen M, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021.
Fried D, et al. InCoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999. 2022.
Li Y, et al. Competition-level code generation with AlphaCode. Science. 2022;378(6624):1092-1097. https://doi.org/10.1126/science.abq1158
Sherje N. Enhancing Software Development Efficiency through AI-Powered Code Generation. Res J Comput Syst Eng. 2024;5(1):01-12.
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Muthu Ramachandran, PhD, Steven Fouracre

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors retain copyright of their work, with first publication rights granted to Blockchain in Healthcare Today (BHTY). Read the full Copyright Statement.













