数据管理

科学数据视角下AlphaFold的迭代突破与数据策略启示

  • 欧阳峥峥 ,
  • 马毓聪 ,
  • 寇远涛 ,
  • 鲜国建 ,
  • 王辉 ,
  • 赵群
展开
  • 1.中国农业科学院农业信息研究所北京 100081
    2.中国科学院成都文献情报中心成都 610299
    3.农业融合出版知识挖掘与知识服务重点实验室北京 100081
    4.中国科学院文献情报中心北京 100190
    5.中国科学院大学经济管理学院信息资源管理系北京 100190
欧阳峥峥,Email:oyzz@clas.ac.cn
马毓聪,Email:mayc@clas.ac.cn
寇远涛,Email:kouyuantao@caas.cn

收稿日期: 2025-10-14

  修回日期: 2025-11-09

  网络出版日期: 2025-12-26

基金资助

2024年度国家新闻出版署农业融合出版知识挖掘与知识服务重点实验室开放课题基金资助项目(2024KMKS05);中国科学院成都文献情报中心2023年度创新基金重点项目(E3Z0000901)

Unveiling AlphaFold’s Iterative Breakthroughs: Data Strategy Insights from a Scientific Perspective

  • OUYANG ZhengZheng ,
  • MA YuCong ,
  • KOU YuanTao ,
  • XIAN GuoJian ,
  • WANG Hui ,
  • ZHAO Qun
Expand
  • 1. Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
    2. National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu 610299, China
    3. Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing, Beijing 100081, China
    4. National Science Library, Chinese Academy of Sciences, Beijing 100190, China
    5. Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

Received date: 2025-10-14

  Revised date: 2025-11-09

  Online published: 2025-12-26

摘要

AlphaFold系列模型在结构生物学领域的革命性突破常被归因于算法创新,但其背后更为根本的科学数据策略演进却鲜有系统性剖析。本文从科学数据的核心视角出发,系统解构AlphaFold 1至3代的迭代突破机制,聚焦数据内在属性优化、表征范式革新、数据-模型协同适配三大关键层面,论证模型每一次性能跃升的本质均是数据-模型协同进化的结果。研究揭示:AlphaFold的演进是其数据策略从被动沿用、主动构建到生成赋能的历程。基于此,本文提炼出三大核心规律:表征范式的跃迁是突破的核心驱动,数据-模型的协同演进是成熟的关键标志,而数据内在属性的丰富度则决定了AI学习范式的上限。这些规律为AI for Science(AI4S)领域带来四大关键启示:数据工作需从被动准备转向主动设计;研发应从“模型/数据中心”转向以“契合度”为中心;数据体系构建应靶向提升核心属性而非盲目多模态聚合;业界亟待构建一套衡量数据“科学效能”的全新理论与评估框架,为AI驱动的科学发现提供理论支撑与路径参考。

本文引用格式

欧阳峥峥 , 马毓聪 , 寇远涛 , 鲜国建 , 王辉 , 赵群 . 科学数据视角下AlphaFold的迭代突破与数据策略启示[J]. 农业大数据学报, 2025 , 7(4) : 485 -495 . DOI: 10.19788/j.issn.2096-6369.000136

Abstract

The transformative breakthroughs of the AlphaFold series in structural biology are often attributed to algorithmic advances, yet the critical role of its evolving data strategy remains underexplored. Adopting a data-centric perspective, this paper deconstructs the iterative mechanisms driving AlphaFold’s progress from versions 1 to 3, emphasizing the optimization of data quality attributes, innovations in representation paradigms, and data-model synergy. The analysis reveals that each performance leap stems from the co-evolution of data and model architectures. AlphaFold’s data strategy follows a clear trajectory: from passive data adoption, to proactive data construction, and finally to generative data augmentation. From this, three core principles emerge: paradigm shifts in data representation are the primary drivers of breakthroughs; data-model co-evolution is a hallmark of system maturity; and the richness of data quality attributes sets the ceiling for an AI’s learning potential. These principles yield four implications for the AI for Science (AI4S) field: data practices should shift from passive preparation to active design; research should prioritize data-model alignment over model- or data-centric approaches; data ecosystems should focus on enhancing key attributes, such as diversity and quality, rather than broad multimodal integration; and a new theoretical and evaluation framework is needed to assess the "scientific efficacy" of data. This study provides a theoretical foundation and practical roadmap for advancing AI-driven scientific discovery.

参考文献

[1] BAI X C, MCMULLAN G, SCHERES S H. How cryo-EM is revolutionizing structural biology. Trends in Biochemical Sciences, 2015, 40(1):49-57.
[2] RCSB?PDB. 2024?RCSB?PDB?Advisory?Committee?Meeting,2024[R]. https://cdn.rcsb.org/rcsb-pdb/general_information/about_pdb/rcsbpdbac24-presentations.pdf.
[3] LYUMKIS D. Challenges and opportunities in cryo-EM single- particle analysis. Journal of Biological Chemistry, 2019, 294(13): 5181-5197.
[4] GAO T, DAMBORSKY J, JANIN Y L, et al. Deciphering enzyme mechanisms with engineered ancestors and substrate analogues. Chemical Catalysis and Chemistry, 2023, 15(19):e202300745.
[5] NAKANE T, KOTECHA A, SENTE A, et al. Single-particle cryo-EM at atomic resolution. Nature, 2020, 587(7832):152-156.
[6] HO C M, LI X R, MASON L, et al. Bottom-up structural proteomics: cryoEM of protein complexes enriched from the cellular milieu. Nature Methods, 2020, 17:79-85.
[7] SANJYOT V S, DEEPTARUP B, ARTHUR Z, et al. AlphaCross-XL: A seamless tool for automated and proteome-scale mapping of crosslinked peptides onto three-dimensional protein structures. Molecular&Cellular Proteomics, 2025, 24(17):101057.
[8] SENIOR A W, EVANS R, JUMPER J, et al. Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792):706-710.
[9] JUMPER J, EVANS R, PRITZEL A, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 2021, 596(7873): 583-589.
[10] VARADI M, ANYANGO S, DESHPANDE M, et al. AlphaFold DB: A comprehensive database of protein structures predicted using AlphaFold. Nucleic Acids Research, 2022, 50(D1):D439-D444.
[11] ABRAMSON J, EVANS R, PRITZEL A, et al. AlphaFold 3: High- accuracy structure prediction for complex molecular systems. Nature, 2024, 625(7996):479-488.
[12] BOUATTA N, SORGER P K. AI-driven structural biology: from predictions to paradigm shifts. Nature Reviews Molecular Cell Biology, 2022, 23(12):745-759.
[13] 孙坦, 张智雄, 周力虹, 等. 人工智能驱动的第五科研范式(AI4S)变革与观察. 农业图书情报学报, 2023, 35(10):4-32.
  SUN T, ZHANG Z X, ZHOU L H, et al. The transformation and observations of AI for science(AI4S) driven by artificial intelligence. Journal of Library and Information Science in Agriculture, 2023, 35(10):4-32.
[14] 中国科学技术信息研究所. AI for Science创新图谱, 2025[R].
  Institute of Scientific and Technical Information of China. AI for Science Innovation Map, 2025[R].
[15] YANG Z Y, ZENG X X, ZHAO Y, et al. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 2023, 8(1):115.
[16] GUO F, GUAN R C, LI Y H, et al. Foundation models in bioinformatics. National Science Review, 2025, 12(4):nwaf028.
[17] FAN Z G, YANG Y D, XU M Y, et al. EC-Conf: A Ultra-fast diffusion model for molecular conformation generation with equivariant consistency. Journal of Cheminformatics, 2024, 16(1):107.
[18] XU M, YU L, SONG Y, et al. Geodiff: A geometric diffusion model for molecular conformation generation[EB/OL].[2022-03-06]. https://arxiv.org/abs/2203.02923.
[19] BAEK M, DIMAIO F, ANISHCHENKO I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021, 373(6557):871-876.
[20] ROSIGNOLI S, PACELLI M, MANGANIELLO F, et al. An outlook on structural biology after Alpha Fold: tools, limits and perspectives. FEBS Open Bio, 2025, 15(2):202-222.
[21] KRISHNA R, WANG J, AHERN W, et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science, 2024, 384(6693):eadl2528.
[22] KRYSHTAFOVYCH A, SCHWEDE T, TOPF M, et al. Critical assessment of methods of protein structure prediction (CASP)-round XIII. Proteins, 2019, 87(12):1011-1120.
[23] KRYSHTAFOVYCH A, SCHWEDE T, TOPF M, et al. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics, 2021, 89(12): 1607-1617.
[24] ABRAMSON J, ADLER J, DUNGER J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature, 2024, 630:493-500.
[25] MCMASTER B, THORPE C, OGG G, et al. Can AlphaFold’s breakthrough in protein structure help decode the fundamental principles of adaptive cellular immunity?. Nature Method, 2024, 21:766-776.
[26] JUMPER J, EVANS R, PRITZEL A, et al. Applying and improving AlphaFold at CASP14. Proteins: Structure, Function, and Bioinformatics, 2021, 89:1711-1721.
[27] VARADI M, BERTONI D, MAGANA P, et al. AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 2024,52(D1): D368-D375.
[28] LAURENTS D V. AlphaFold 2 and NMR spectroscopy: partners to understand protein structure, dynamics and function. Frontiers in Molecular Biosciences, 2022, 17(9):906437.
[29] FANG Z, RAN H, ZHANG Y, et al. AlphaFold 3: an unprecedent opportunity for fundamental research and drug development. Precision Clinical Medicine, 2025, 8(3): pbaf015.
[30] GOWTHAMAN R, PARK M, YIN R, et al. AlphaFold and docking approaches for antibody-antigen and other targets: insights from CAPRI rounds 47-55. Proteins: Structure, Function, and Bioinformatics, 2025, 0:1-6. DOI: 10.1002/prot.26801.
[31] DAMA International. DAMA-DMBOK: Data management body of knowledge (2nd ed.),2017[R]. Technics Publications.
[32] LECUN Y, BENGIO Y, HINTON G. Deep learning. Nature, 2015, 521(7553):436-444.
[33] AI Business. ScaleUp AI 2022:Google brain’s Andrew Ng says data-centric approach ups AI success[EB/OL].[2022-04-06]. https://aibusiness.com/companies/scaleup-ai-2022-google-brain-s-andrew-ng-says-data-centric-approach-ups-ai-success.
[34] BERTOLINE L M F, LIMA A N, KRIEGER, J E, et al. Before and after AlphaFold2: An overview of protein structure prediction. Frontiers in bioinformatics, 2023, 3, 1120370.
[35] NIAZI S K, MARIAM Z, PARACHA R Z. Limitations of protein structure prediction algorithms in therapeutic protein development. BioMedInformatics, 2024, 4:98-112.
[36] YUAN Q M, CHEN S, RAO J H, et al. AlphaFold2-aware protein- DNA binding site prediction using graph transformer. Briefings in Bioinformatics, 2022, 23(2):bbab564.
[37] REMMERT M, BIEGERT A, HAUSER A, et al. HHblits: lightning- fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods, 2012, 9(2): 173-175.
文章导航

/