Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare

Muthu Ramachandran; Steven Fouracre

doi:10.30953/bhty.v8.396

Authors

Muthu Ramachandran, PhD Research Consultant atForti5 Tech and at Self-Evolving Software (SES) Systems Group, London, UK; Professor Extraordinarous at University of South Africa (UniSA), Pretoria, South Africa https://orcid.org/0000-0002-5303-3100
Steven Fouracre SES Systems Group

DOI:

https://doi.org/10.30953/bhty.v8.396

Keywords:

Code generative AI, Gen AI in healthcare, large language models, LLM, rich data, self-evolving software, SES, software engineering

Abstract

This paper evaluates the critical trade-offs between "rich data" and "data quantity" approaches in Code Generation AI (Code Gen AI) and autonomous code agents, particularly in high-integrity sectors like healthcare. While Code Gen AI can enhance productivity by up to 55% in controlled environments, systems trained on unfiltered, large-scale datasets often increase code duplication, churn, and error rates. The paper demonstrates that in sectors where accuracy, auditability, and privacy are paramount, data richness consistently outperforms brute-force scaling strategies. Using Self-Evolving Software (SES) as a case study, the research contrasts outcomes from both paradigms and proposes a weighted matrix for data selection in Code Gen AI systems. The findings show that rich, curated, domain-specific datasets produce more reliable, compliant, and sustainable code with significantly reduced technical debt, particularly in regulated environments where quality and ethical considerations are essential. The paper concludes with best practice guidelines for implementing Code Gen AI in sensitive sectors, emphasizing that sustainable software engineering requires scaling ethically and intelligently rather than indiscriminately.

Downloads

Download data is not yet available.

References

GitClear. The Impact of AI on Code Duplication, Churn and Defects [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://arc.dev/talent-blog/impact-of-ai-on-code/

EU AI Act & GDPR Regulations [Internet]. [cited 2025 Apr 20]. Available from: https://artificialintelligenceact.eu/

Tantithamthavorn C, et al. Explainable AI for SE. IEEE Software. 2023.

Self-Evolving Software [Internet]. [cited 2025 Apr 20]. Available from: https://www.selfevolvingsoftware.com

Brundage M, et al. Lessons learned on language model safety and misuse. OpenAI; 2022, https://openai.com/index/language-model-safety-and-misuse/

SES Overview. Internal Whitepaper; 2024.

Nijkamp E, Zhao J, Poesia G, Xiong C. CodeGen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2303.17568. 2023.

Perry N, et al. Do users write more insecure code with AI assistants? arXiv preprint arXiv:2211.03622. 2022.

Bell E. Generative AI vs. Large Language Models (LLMs): What's the Difference? [Internet]. 2024 [cited 2025 Apr 20]. Available from: https://appian.com/blog/acp/process-automation/generative-ai-vs-large-language-models

IBM. AI code-generation software: What it is and how it works? [Internet]. 2023 [cited 2025 Apr 20]. Available from: https://www.ibm.com/think/topics/ai-code-generation

American Psychological Association. Publication manual of the American Psychological Association. 7th ed. 2019. https://doi.org/10.1037/0000165-000

Bommasani R, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021.

Brown TB, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901.

Chen M, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021.

Fried D, et al. InCoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999. 2022.

Li Y, et al. Competition-level code generation with AlphaCode. Science. 2022;378(6624):1092-1097. https://doi.org/10.1126/science.abq1158

Sherje N. Enhancing Software Development Efficiency through AI-Powered Code Generation. Res J Comput Syst Eng. 2024;5(1):01-12.

Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

map

About BHTY

Indexed

Why BTY

CFP

MS Prep

Info & Policy

Compendium

Current Issue

BHTY Career Site

Subscription

blog

bhtyfirst

Information

Rich Data Versus Quantity of Data in Code Generation AI: A Paradigm Shift for Healthcare

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

map

​About BHTY

Indexed

Why BTY

CFP

MS Prep

Info & Policy

Compendium

Current Issue

BHTY Career Site

Subscription

blog

social

bhtyfirst

Information

About BHTY