Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare

Authors

  • Muthu Ramachandran, PhD Research Consultant atForti5 Tech and at Self-Evolving Software (SES) Systems Group, London, UK; Professor Extraordinarous at University of South Africa (UniSA), Pretoria, South Africa https://orcid.org/0000-0002-5303-3100
  • Steven Fouracre SES Systems Group

DOI:

https://doi.org/10.30953/bhty.v8.396

Keywords:

Code generative AI, Gen AI in healthcare, large language models, LLM, rich data, self-evolving software, SES, software engineering

Abstract

This paper evaluates the critical trade-offs between "rich data" and "data quantity" approaches in Code Generation AI (Code Gen AI) and autonomous code agents, particularly in high-integrity sectors like healthcare. While Code Gen AI can enhance productivity by up to 55% in controlled environments, systems trained on unfiltered, large-scale datasets often increase code duplication, churn, and error rates. The paper demonstrates that in sectors where accuracy, auditability, and privacy are paramount, data richness consistently outperforms brute-force scaling strategies. Using Self-Evolving Software (SES) as a case study, the research contrasts outcomes from both paradigms and proposes a weighted matrix for data selection in Code Gen AI systems. The findings show that rich, curated, domain-specific datasets produce more reliable, compliant, and sustainable code with significantly reduced technical debt, particularly in regulated environments where quality and ethical considerations are essential. The paper concludes with best practice guidelines for implementing Code Gen AI in sensitive sectors, emphasizing that sustainable software engineering requires scaling ethically and intelligently rather than indiscriminately.

Downloads

Download data is not yet available.

References

GitClear. The Impact of AI on Code Duplication, Churn and Defects [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://arc.dev/talent-blog/impact-of-ai-on-code/

EU AI Act & GDPR Regulations [Internet]. [cited 2025 Apr 20]. Available from: https://artificialintelligenceact.eu/

Tantithamthavorn C, et al. Explainable AI for SE. IEEE Software. 2023.

Self-Evolving Software [Internet]. [cited 2025 Apr 20]. Available from: https://www.selfevolvingsoftware.com

Brundage M, et al. Lessons learned on language model safety and misuse. OpenAI; 2022, https://openai.com/index/language-model-safety-and-misuse/

SES Overview. Internal Whitepaper; 2024.

Nijkamp E, Zhao J, Poesia G, Xiong C. CodeGen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2303.17568. 2023.

Perry N, et al. Do users write more insecure code with AI assistants? arXiv preprint arXiv:2211.03622. 2022.

Bell E. Generative AI vs. Large Language Models (LLMs): What's the Difference? [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://appian.com/blog/acp/process-automation/generative-ai-vs-large-language-models

IBM. AI code-generation software: What it is and how it works? [Internet]. 2023 [cited 2025 Apr 20]. Available from: https://www.ibm.com/think/topics/ai-code-generation

American Psychological Association. Publication manual of the American Psychological Association. 7th ed. 2019. https://doi.org/10.1037/0000165-000

Bommasani R, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021.

Brown TB, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901.

Chen M, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021.

Fried D, et al. InCoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999. 2022.

Li Y, et al. Competition-level code generation with AlphaCode. Science. 2022;378(6624):1092-1097. https://doi.org/10.1126/science.abq1158

Sherje N. Enhancing Software Development Efficiency through AI-Powered Code Generation. Res J Comput Syst Eng. 2024;5(1):01-12.

Published

2025-06-15

How to Cite

Ramachandran, M., & Fouracre, S. (2025). Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare. Blockchain in Healthcare Today, 8(1). https://doi.org/10.30953/bhty.v8.396

Issue

Section

Narrative/Systematic Review/Meta-Analysis