Unveiling AlphaFold’s Iterative Breakthroughs: Data Strategy Insights from a Scientific Perspective

  • OUYANG ZhengZheng ,
  • MA YuCong ,
  • KOU YuanTao ,
  • XIAN GuoJian ,
  • WANG Hui ,
  • ZHAO Qun
Expand
  • 1. Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
    2. National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu 610299, China
    3. Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing, Beijing 100081, China
    4. National Science Library, Chinese Academy of Sciences, Beijing 100190, China
    5. Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

Received date: 2025-10-14

  Revised date: 2025-11-09

  Online published: 2025-12-26

Abstract

The transformative breakthroughs of the AlphaFold series in structural biology are often attributed to algorithmic advances, yet the critical role of its evolving data strategy remains underexplored. Adopting a data-centric perspective, this paper deconstructs the iterative mechanisms driving AlphaFold’s progress from versions 1 to 3, emphasizing the optimization of data quality attributes, innovations in representation paradigms, and data-model synergy. The analysis reveals that each performance leap stems from the co-evolution of data and model architectures. AlphaFold’s data strategy follows a clear trajectory: from passive data adoption, to proactive data construction, and finally to generative data augmentation. From this, three core principles emerge: paradigm shifts in data representation are the primary drivers of breakthroughs; data-model co-evolution is a hallmark of system maturity; and the richness of data quality attributes sets the ceiling for an AI’s learning potential. These principles yield four implications for the AI for Science (AI4S) field: data practices should shift from passive preparation to active design; research should prioritize data-model alignment over model- or data-centric approaches; data ecosystems should focus on enhancing key attributes, such as diversity and quality, rather than broad multimodal integration; and a new theoretical and evaluation framework is needed to assess the "scientific efficacy" of data. This study provides a theoretical foundation and practical roadmap for advancing AI-driven scientific discovery.

Cite this article

OUYANG ZhengZheng , MA YuCong , KOU YuanTao , XIAN GuoJian , WANG Hui , ZHAO Qun . Unveiling AlphaFold’s Iterative Breakthroughs: Data Strategy Insights from a Scientific Perspective[J]. Journal of Agricultural Big Data, 2025 , 7(4) : 485 -495 . DOI: 10.19788/j.issn.2096-6369.000136

References

[1] BAI X C, MCMULLAN G, SCHERES S H. How cryo-EM is revolutionizing structural biology. Trends in Biochemical Sciences, 2015, 40(1):49-57.
[2] RCSB?PDB. 2024?RCSB?PDB?Advisory?Committee?Meeting,2024[R]. https://cdn.rcsb.org/rcsb-pdb/general_information/about_pdb/rcsbpdbac24-presentations.pdf.
[3] LYUMKIS D. Challenges and opportunities in cryo-EM single- particle analysis. Journal of Biological Chemistry, 2019, 294(13): 5181-5197.
[4] GAO T, DAMBORSKY J, JANIN Y L, et al. Deciphering enzyme mechanisms with engineered ancestors and substrate analogues. Chemical Catalysis and Chemistry, 2023, 15(19):e202300745.
[5] NAKANE T, KOTECHA A, SENTE A, et al. Single-particle cryo-EM at atomic resolution. Nature, 2020, 587(7832):152-156.
[6] HO C M, LI X R, MASON L, et al. Bottom-up structural proteomics: cryoEM of protein complexes enriched from the cellular milieu. Nature Methods, 2020, 17:79-85.
[7] SANJYOT V S, DEEPTARUP B, ARTHUR Z, et al. AlphaCross-XL: A seamless tool for automated and proteome-scale mapping of crosslinked peptides onto three-dimensional protein structures. Molecular&Cellular Proteomics, 2025, 24(17):101057.
[8] SENIOR A W, EVANS R, JUMPER J, et al. Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792):706-710.
[9] JUMPER J, EVANS R, PRITZEL A, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 2021, 596(7873): 583-589.
[10] VARADI M, ANYANGO S, DESHPANDE M, et al. AlphaFold DB: A comprehensive database of protein structures predicted using AlphaFold. Nucleic Acids Research, 2022, 50(D1):D439-D444.
[11] ABRAMSON J, EVANS R, PRITZEL A, et al. AlphaFold 3: High- accuracy structure prediction for complex molecular systems. Nature, 2024, 625(7996):479-488.
[12] BOUATTA N, SORGER P K. AI-driven structural biology: from predictions to paradigm shifts. Nature Reviews Molecular Cell Biology, 2022, 23(12):745-759.
[13] 孙坦, 张智雄, 周力虹, 等. 人工智能驱动的第五科研范式(AI4S)变革与观察. 农业图书情报学报, 2023, 35(10):4-32.
  SUN T, ZHANG Z X, ZHOU L H, et al. The transformation and observations of AI for science(AI4S) driven by artificial intelligence. Journal of Library and Information Science in Agriculture, 2023, 35(10):4-32.
[14] 中国科学技术信息研究所. AI for Science创新图谱, 2025[R].
  Institute of Scientific and Technical Information of China. AI for Science Innovation Map, 2025[R].
[15] YANG Z Y, ZENG X X, ZHAO Y, et al. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 2023, 8(1):115.
[16] GUO F, GUAN R C, LI Y H, et al. Foundation models in bioinformatics. National Science Review, 2025, 12(4):nwaf028.
[17] FAN Z G, YANG Y D, XU M Y, et al. EC-Conf: A Ultra-fast diffusion model for molecular conformation generation with equivariant consistency. Journal of Cheminformatics, 2024, 16(1):107.
[18] XU M, YU L, SONG Y, et al. Geodiff: A geometric diffusion model for molecular conformation generation[EB/OL].[2022-03-06]. https://arxiv.org/abs/2203.02923.
[19] BAEK M, DIMAIO F, ANISHCHENKO I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021, 373(6557):871-876.
[20] ROSIGNOLI S, PACELLI M, MANGANIELLO F, et al. An outlook on structural biology after Alpha Fold: tools, limits and perspectives. FEBS Open Bio, 2025, 15(2):202-222.
[21] KRISHNA R, WANG J, AHERN W, et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science, 2024, 384(6693):eadl2528.
[22] KRYSHTAFOVYCH A, SCHWEDE T, TOPF M, et al. Critical assessment of methods of protein structure prediction (CASP)-round XIII. Proteins, 2019, 87(12):1011-1120.
[23] KRYSHTAFOVYCH A, SCHWEDE T, TOPF M, et al. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics, 2021, 89(12): 1607-1617.
[24] ABRAMSON J, ADLER J, DUNGER J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature, 2024, 630:493-500.
[25] MCMASTER B, THORPE C, OGG G, et al. Can AlphaFold’s breakthrough in protein structure help decode the fundamental principles of adaptive cellular immunity?. Nature Method, 2024, 21:766-776.
[26] JUMPER J, EVANS R, PRITZEL A, et al. Applying and improving AlphaFold at CASP14. Proteins: Structure, Function, and Bioinformatics, 2021, 89:1711-1721.
[27] VARADI M, BERTONI D, MAGANA P, et al. AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 2024,52(D1): D368-D375.
[28] LAURENTS D V. AlphaFold 2 and NMR spectroscopy: partners to understand protein structure, dynamics and function. Frontiers in Molecular Biosciences, 2022, 17(9):906437.
[29] FANG Z, RAN H, ZHANG Y, et al. AlphaFold 3: an unprecedent opportunity for fundamental research and drug development. Precision Clinical Medicine, 2025, 8(3): pbaf015.
[30] GOWTHAMAN R, PARK M, YIN R, et al. AlphaFold and docking approaches for antibody-antigen and other targets: insights from CAPRI rounds 47-55. Proteins: Structure, Function, and Bioinformatics, 2025, 0:1-6. DOI: 10.1002/prot.26801.
[31] DAMA International. DAMA-DMBOK: Data management body of knowledge (2nd ed.),2017[R]. Technics Publications.
[32] LECUN Y, BENGIO Y, HINTON G. Deep learning. Nature, 2015, 521(7553):436-444.
[33] AI Business. ScaleUp AI 2022:Google brain’s Andrew Ng says data-centric approach ups AI success[EB/OL].[2022-04-06]. https://aibusiness.com/companies/scaleup-ai-2022-google-brain-s-andrew-ng-says-data-centric-approach-ups-ai-success.
[34] BERTOLINE L M F, LIMA A N, KRIEGER, J E, et al. Before and after AlphaFold2: An overview of protein structure prediction. Frontiers in bioinformatics, 2023, 3, 1120370.
[35] NIAZI S K, MARIAM Z, PARACHA R Z. Limitations of protein structure prediction algorithms in therapeutic protein development. BioMedInformatics, 2024, 4:98-112.
[36] YUAN Q M, CHEN S, RAO J H, et al. AlphaFold2-aware protein- DNA binding site prediction using graph transformer. Briefings in Bioinformatics, 2022, 23(2):bbab564.
[37] REMMERT M, BIEGERT A, HAUSER A, et al. HHblits: lightning- fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods, 2012, 9(2): 173-175.
Outlines

/