A Multi-Omics Dataset for Functional Gene Mining in Animals

LIU Hong, DOU JingWen, WANG Yue, LIAO Yong, LIU XiaoLei, LI XinYun, ZHAO ShuHong, FU YuHua

doi:10.19788/j.issn.2096-6369.100039

Journal of Agricultural Big Data >

2025 , Vol. 7 >Issue 1: 96 - 106

DOI: https://doi.org/10.19788/j.issn.2096-6369.100039

A Multi-Omics Dataset for Functional Gene Mining in Animals

Expand

1. Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science & Technology, Huazhong Agricultural University, Wuhan 430070, China
2. Hubei Hongshan Laboratory, Wuhan 430070, China

Received date: 2024-06-06

Accepted date: 2024-09-13

Online published: 2025-02-05

Fold

Abstract

Single-omics data alone is insufficient to comprehensively reveal the complex molecular mechanisms of gene regulation traits. Integrating different types and levels of biological omics data is of great significance for understanding the complex molecular networks within organisms. This dataset provides individual-level omics data (WGS, RNA-Seq, ChIP-Seq, and ATAC-Seq) and genome annotation information for 61,191 individuals from 21 animal species, with an effective data size of 2.8 TB. Additionally, this dataset includes gene and phenotype entity recognition data obtained through deep learning algorithms. Overall, this multi-omics dataset can be used for gene discovery and functional validation of agriculturally important traits, offering valuable resources for cross-species comparative studies. It also supports the construction of models for identifying key genes associated with economic traits in animals and facilitates algorithm research.

Data summary:

Item	Description
Dataset name	A Multi-Omics Dataset for Functional Gene Mining in Animals
Specific subject area	Agronomy
Research topic	Animal Multi-Omics Dataset
Time range	2000-2022
Data types and technical formats	.txt,.vcf, ped, map, bed, bim, fam
Dataset stucture	The dataset consists of five parts: Functional annotation information for 403,216 genes across 21 species. Genomic variation data for 10,835 individuals from 21 species, encompassing 877.59 million variations. Gene expression matrix data for 44,638 individuals from 21 species. Epigenetic signal matrix data for 5,718 individuals from 21 species, including 124 markers such as H3K27ac. The pre-labeled gene and phenotype data of 2794237 articles from 21 species.
Volume of dataset	2.8 TB
Key index in dataset	Gene functional annotation, genomic variation information, gene expression matrices, epigenetic signal matrices, gene and phenotypic pre-labeled data
Data accessibility	https://cstr.cn/17058.11.sciencedb.agriculture.00024 https://doi.org/10.57760/sciencedb.agriculture.00024 PUBLIC, CC BY-NC 4.0
Financial support	National Natural Science Foundation of China General Program (32272841); Hubei International Science and technology cooperation project (2022EHB055)

Key words： multi-omics data; cross-species; functional gene mining; individual level; deep learning

Cite this article

LIU Hong, DOU JingWen, WANG Yue, LIAO Yong, LIU XiaoLei, LI XinYun, ZHAO ShuHong, FU YuHua . A Multi-Omics Dataset for Functional Gene Mining in Animals[J]. Journal of Agricultural Big Data, 2025 , 7(1) : 96 -106 . DOI: 10.19788/j.issn.2096-6369.100039

References

[1]	FU Y, XU J, TANG Z, et al. A gene prioritization method based on a swine multi-omics knowledgebase and a deep learning model. Communications Biology, 2020, 3(1): 502.
[2]	刘松誉, 王向峰. 多组学数据关联分析挖掘玉米抗逆基因(英文). 第二十届中国作物学会学术年会.中国湖南长沙:2023.
[3]	刘华涛, 马福平, 赵卿尧, 等. 联合多组学数据鉴定猪脂肪沉积的候选基因. 中国畜牧杂志, 2023, 59(8): 123-130.
[4]	赵黄青, 马钧, 李欣淼, 等. 多组学分析技术在肉牛生长发育研究中的应用. 中国畜禽种业, 2023, 19(7): 43-49.
[5]	CUNNINGHAM F, ALLEN J E, ALLEN J, et al. Ensembl 2022. Nucleic Acids Research, 2022, 50(D1):D988-D995. doi: 10.1093/nar/gkab1049D988-d95.
[6]	KATZ K, SHUTOV O, LAPOINT R, et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Research, 2022, 50(D1): D387-D390. doi: 10.1093/nar/gkab1053.
[7]	CANTELLI G, BATEMAN A, BROOKSBANK C, et al. The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Research, 2022, 50(D1):D11-D19. doi:10.1093/nar/gkab1127.
[8]	SAYERS E W, BECK J, BOLTON E E, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2021, 49(D1): D10-d7.
[9]	BOUTET E, LIEBERHERR D, TOGNOLLI M, et al. UniProtKB/ Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods in Molecular Biology, 2016, 1374:23-54. doi: 10.1007/978-1-4939-3167-5_2.
[10]	KANEHISA M, GOTO S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 2000, 28(1): 27-30.
[11]	Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Research. 2021, 49(D1):D325-D334. doi: 10.1093/nar/gkaa1113.
[12]	MISTRY J, CHUGURANSKY S, WILLIAMS L, et al. Pfam: The protein families database in 2021. Nucleic Acids Research, 2021, 49(D1):D412-D419. doi: 10.1093/nar/gkaa913.
[13]	BLUM M, CHANG H Y, CHUGURANSKY S, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 2021, 49(D1): D344-d54.
[14]	TATUSOV R L, FEDOROVA N D, JACKSON J D, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 2003, 4: 41. doi: 10.1186/1471-2105-4-41.
[15]	CHEN S, ZHOU Y, CHEN Y, et al. Bioinformatics, 2018, 34(17): i884-i890. doi:10.1093/bioinformatics/bty560.
[16]	LI H, DURBIN R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760.
[17]	ALDANA R, FREED D. Data Processing and Germline Variant Calling with the Sentieon Pipeline. Methods in Molecular Biology, 2022, 2493: 1-19.
[18]	WANG K, LI M, HAKONARSON H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 2010, 38(16): e164.
[19]	KIM D, PAGGI J M, PARK C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 2019, 37(8):907-915. doi: 10.1038/s41587-019-0201-4.
[20]	PERTEA M, PERTEA G M, ANTONESCU C M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 2015, 33(3): 290-295.
[21]	ZHANG H, SONG L, WANG X, et al. Fast alignment and preprocessing of chromatin profiles with Chromap. Nature Communications, 2021, 12(1): 6566.
[22]	LIU T. Use model-based analysis of ChIP-Seq (MACS) to analyze short reads generated by sequencing protein-DNA interactions in embryonic stem cells. Methods in Molecular Biology, 2014, 1150: 81-95.
[23]	NASSAR L R, BARBER G P, BENET-PAGèS A, et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Research, 2023, 51(D1): D1188-D1195.
[24]	LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234-1240.
[25]	SHANG J, LIU L, REN X, et al. Learning named entity tagger using domain-specific dictionary. arXiv:180903599, 2018.
[26]	Di TOMMASO P, CHATZOU M, FLODEN E W, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology, 2017, 35(4): 316-319.
[27]	FU Y, LIU H, DOU J, et al. IAnimal: a cross-species omics knowledgebase for animals. Nucleic Acids Res, 2023, 51(D1): D1312-D1324.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References