一种面向功能基因挖掘的动物多组学数据集
收稿日期: 2024-06-06
录用日期: 2024-09-13
网络出版日期: 2025-02-05
基金资助
国家自然科学基金面上项目(32272841);湖北国际科技合作项目(2022EHB055)
A Multi-Omics Dataset for Functional Gene Mining in Animals
Received date: 2024-06-06
Accepted date: 2024-09-13
Online published: 2025-02-05
单一的组学数据难以全面揭示基因调控性状的复杂分子机制,整合不同类型和层次的生物组学数据对于理解生物体内复杂的分子网络具有重要的意义。本数据集提供了包含21个动物物种的61191个个体水平组学数据(WGS、RNA-Seq、ChIP-Seq和ATAC-Seq)和基因组注释信息,有效数据规模为2.8 TB。此外,本数据集还收录了基于深度学习算法得到的基因与表型实体识别数据。总的来说,该多组学数据集可用于农业重要性状的基因发掘和功能验证,能够为跨物种比较研究提供有价值的资源,也可更好地服务于动物经济性状关键基因识别模型构建以及算法研究。
数据摘要:
| 项目 | 描述 |
|---|---|
| 数据库(集)名称 | 一种面向功能基因挖掘的动物多组学数据集 |
| 所属学科 | 农学 |
| 研究主题 | 动物多组学数据集 |
| 数据时间范围 | 2000-2022年 |
| 数据类型与技术格式 | .txt,.vcf,ped,map,bed,bim,fam |
| 数据库(集)组成 | 数据集由五部分组成: 21个物种403216个基因的功能注释信息。 21个物种10835个个体的基因组变异数据,共包含877.59M变异。 21个物种44638个个体的基因表达矩阵数据。 21个物种5718个个体的表观信号矩阵数据,包含H3K27ac等124个marker。 21个物种2794237篇文献的基因、表型预标注数据。 |
| 数据量 | 2.8 TB |
| 主要数据指标 | 基因功能注释、基因组变异信息、基因表达矩阵、表观信号矩阵、基因和表型预标注数据 |
| 数据可用性 | https://cstr.cn/17058.11.sciencedb.agriculture.00024 https://doi.org/10.57760/sciencedb.agriculture.00024 |
| 经费支持 | 国家自然科学基金面上项目(32272841);湖北国际科技合作项目(2022EHB055) |
刘洪, 窦婧文, 王越, 廖勇, 刘小磊, 李新云, 赵书红, 付玉华 . 一种面向功能基因挖掘的动物多组学数据集[J]. 农业大数据学报, 2025 , 7(1) : 96 -106 . DOI: 10.19788/j.issn.2096-6369.100039
Single-omics data alone is insufficient to comprehensively reveal the complex molecular mechanisms of gene regulation traits. Integrating different types and levels of biological omics data is of great significance for understanding the complex molecular networks within organisms. This dataset provides individual-level omics data (WGS, RNA-Seq, ChIP-Seq, and ATAC-Seq) and genome annotation information for 61,191 individuals from 21 animal species, with an effective data size of 2.8 TB. Additionally, this dataset includes gene and phenotype entity recognition data obtained through deep learning algorithms. Overall, this multi-omics dataset can be used for gene discovery and functional validation of agriculturally important traits, offering valuable resources for cross-species comparative studies. It also supports the construction of models for identifying key genes associated with economic traits in animals and facilitates algorithm research.
Data summary:
| Item | Description |
|---|---|
| Dataset name | A Multi-Omics Dataset for Functional Gene Mining in Animals |
| Specific subject area | Agronomy |
| Research topic | Animal Multi-Omics Dataset |
| Time range | 2000-2022 |
| Data types and technical formats | .txt,.vcf, ped, map, bed, bim, fam |
| Dataset stucture | The dataset consists of five parts: Functional annotation information for 403,216 genes across 21 species. Genomic variation data for 10,835 individuals from 21 species, encompassing 877.59 million variations. Gene expression matrix data for 44,638 individuals from 21 species. Epigenetic signal matrix data for 5,718 individuals from 21 species, including 124 markers such as H3K27ac. The pre-labeled gene and phenotype data of 2794237 articles from 21 species. |
| Volume of dataset | 2.8 TB |
| Key index in dataset | Gene functional annotation, genomic variation information, gene expression matrices, epigenetic signal matrices, gene and phenotypic pre-labeled data |
| Data accessibility | https://cstr.cn/17058.11.sciencedb.agriculture.00024 https://doi.org/10.57760/sciencedb.agriculture.00024 PUBLIC, CC BY-NC 4.0 |
| Financial support | National Natural Science Foundation of China General Program (32272841); Hubei International Science and technology cooperation project (2022EHB055) |
| [1] | FU Y, XU J, TANG Z, et al. A gene prioritization method based on a swine multi-omics knowledgebase and a deep learning model. Communications Biology, 2020, 3(1): 502. |
| [2] | 刘松誉, 王向峰. 多组学数据关联分析挖掘玉米抗逆基因(英文). 第二十届中国作物学会学术年会.中国湖南长沙:2023. |
| [3] | 刘华涛, 马福平, 赵卿尧, 等. 联合多组学数据鉴定猪脂肪沉积的候选基因. 中国畜牧杂志, 2023, 59(8): 123-130. |
| [4] | 赵黄青, 马钧, 李欣淼, 等. 多组学分析技术在肉牛生长发育研究中的应用. 中国畜禽种业, 2023, 19(7): 43-49. |
| [5] | CUNNINGHAM F, ALLEN J E, ALLEN J, et al. Ensembl 2022. Nucleic Acids Research, 2022, 50(D1):D988-D995. doi: 10.1093/nar/gkab1049D988-d95. |
| [6] | KATZ K, SHUTOV O, LAPOINT R, et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Research, 2022, 50(D1): D387-D390. doi: 10.1093/nar/gkab1053. |
| [7] | CANTELLI G, BATEMAN A, BROOKSBANK C, et al. The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Research, 2022, 50(D1):D11-D19. doi:10.1093/nar/gkab1127. |
| [8] | SAYERS E W, BECK J, BOLTON E E, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2021, 49(D1): D10-d7. |
| [9] | BOUTET E, LIEBERHERR D, TOGNOLLI M, et al. UniProtKB/ Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods in Molecular Biology, 2016, 1374:23-54. doi: 10.1007/978-1-4939-3167-5_2. |
| [10] | KANEHISA M, GOTO S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 2000, 28(1): 27-30. |
| [11] | Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Research. 2021, 49(D1):D325-D334. doi: 10.1093/nar/gkaa1113. |
| [12] | MISTRY J, CHUGURANSKY S, WILLIAMS L, et al. Pfam: The protein families database in 2021. Nucleic Acids Research, 2021, 49(D1):D412-D419. doi: 10.1093/nar/gkaa913. |
| [13] | BLUM M, CHANG H Y, CHUGURANSKY S, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 2021, 49(D1): D344-d54. |
| [14] | TATUSOV R L, FEDOROVA N D, JACKSON J D, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 2003, 4: 41. doi: 10.1186/1471-2105-4-41. |
| [15] | CHEN S, ZHOU Y, CHEN Y, et al. Bioinformatics, 2018, 34(17): i884-i890. doi:10.1093/bioinformatics/bty560. |
| [16] | LI H, DURBIN R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760. |
| [17] | ALDANA R, FREED D. Data Processing and Germline Variant Calling with the Sentieon Pipeline. Methods in Molecular Biology, 2022, 2493: 1-19. |
| [18] | WANG K, LI M, HAKONARSON H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 2010, 38(16): e164. |
| [19] | KIM D, PAGGI J M, PARK C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 2019, 37(8):907-915. doi: 10.1038/s41587-019-0201-4. |
| [20] | PERTEA M, PERTEA G M, ANTONESCU C M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 2015, 33(3): 290-295. |
| [21] | ZHANG H, SONG L, WANG X, et al. Fast alignment and preprocessing of chromatin profiles with Chromap. Nature Communications, 2021, 12(1): 6566. |
| [22] | LIU T. Use model-based analysis of ChIP-Seq (MACS) to analyze short reads generated by sequencing protein-DNA interactions in embryonic stem cells. Methods in Molecular Biology, 2014, 1150: 81-95. |
| [23] | NASSAR L R, BARBER G P, BENET-PAGèS A, et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Research, 2023, 51(D1): D1188-D1195. |
| [24] | LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234-1240. |
| [25] | SHANG J, LIU L, REN X, et al. Learning named entity tagger using domain-specific dictionary. arXiv:180903599, 2018. |
| [26] | Di TOMMASO P, CHATZOU M, FLODEN E W, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology, 2017, 35(4): 316-319. |
| [27] | FU Y, LIU H, DOU J, et al. IAnimal: a cross-species omics knowledgebase for animals. Nucleic Acids Res, 2023, 51(D1): D1312-D1324. |
/
| 〈 |
|
〉 |