From DNA Analysis To Gene Expression Prediction And Large-scale Language Modeling For Bioinformatics

Large Language Models 21/08/2024

3 main points
✔️Suggests the ability to predict complex properties of the genome that enhance DNA analysis and understanding of genome function
✔️ Suggests new possibilities for genetics research by accurately predicting DNA mutation effects genome-wide
✔️ Accurately identifying key regulatory sequences such as enhancers and promoters, and gene expression Contributing to regulatory research

Large language models in bioinformatics: applications and perspectives
written by Jiajia Liu, Mengyuan Yang, Yankai Yu, Haixia Xu, Kang Li, Xiaobo Zhou
(Submitted on 8 Jan 2024)
Comments: Published on arxiv.
Subjects: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The field ofnatural languageprocessing has evolved exponentially with the advent of large-scale language models such as OpenAI's GPT-X and Google's BERT. These advanced models are taking the ability to understand and produce human language to the limit, revolutionizing everyday communication and business processes.

Large-scale language models learn the complexity and context of language by studying vast amounts of textual data on the Internet, thereby acquiring a deep understanding of the meaning of the text and the ability to respond appropriately. Underpinning these models is an innovative neural network architecture called "transformers". This enables parallel processing and scalability while capturing long-range dependencies of text.

Of particular note is the "self-attention mechanism" employed by the transformer. It evaluates the importance of each word when interpreting a sentence, allowing for a deeper understanding of context. This technique is key to the model's remarkable performance.

Learning takes place in two phases: pre-learning and fine-tuning. In the pre-learning phase, students develop grammatical and factual knowledge and reasoning skills using an extensive corpus of texts. Through fine tuning, these models are also optimized for specific tasks (translation, summarization, question answering, etc.). Their adaptabilityallowsthemto handle awide variety ofnatural language processingtaskswithout relying on a specific architecture. And they have the potential to be applied in a wide variety of fields.

This article investigates how large-scale language models can be applied to a wide variety of issues. Because so many examples are presented, this article presents excerpts.

Examples of applications of large-scale language models in bioinformatics

In biological research, deciphering the language embedded in DNA and revealing the hidden code has long been one major goal. In particular, progress has been made in deciphering the universal genetic code that marks the translation of DNA into proteins through models using modern architectures such as BERT and GPT.

DNABERT utilizes a robust attention-based transformer architecture recognized for a wide variety of natural language processing tasks. DNABERT-2 introduces the Genome Understanding Evaluation (GUE), a comprehensive dataset for multi-genome classification. The model achieves a 3X improvement in efficiency over the previous model, showing improved results in 23 of the 28 datasets used.

GROVER also uses a DNA language model employing byte-pair tokenization to analyze the human genome in detail. The model identifies contextual relationships between tokens and helps identify genomic region structures relevant to functional genomics annotation; GROVER's unique approach is invaluable to researchers exploring the complexity of the genome.

In addition, DNAGPT, developed on the success of the GPT series, is a GPT-based DNA model that has been pre-trained on over 10 billion base pairs of data sets and can be fine-tuned for a variety of DNA sequencing analysis tasks. In addition, Nucleotide Transformer has developed four language models of different sizes, pre-trained on three different datasets covering multiple species.

These pre-trained models have been applied to a wide variety of sequence prediction tasks, including prediction of promoter regions, enhancer regions, cis-regulatory elements, splice sites, and transcription factor binding sites.

One application of large-scale language models in bioinformatics is DNA sequence language models to predict genome-wide mutation effectsThe biodiversity importance of DNA variation is immeasurable. Genome-wide association studies (GWAS) play an important role in elucidating this, but identifying causal variation is a major challenge. The Genomic Prior Learning Network (GPN), developed to address this challenge, aims to acquire knowledge of genome-wide mutation effects through unsupervised prior learning. the GPN predicts nucleotides using 512 base-pair DNA sequences masked at specific positions, and then uses the predicted nucleotides to identify the most likely mutations, It is particularly good at accurately capturing the effects of rare mutations. It has demonstrated the ability to predict mutation effects from DNA sequences from a variety of species, and this technology supports ongoing research to understand the complex relationship between DNA sequence mutations and biological diversity.

The second is a DNA sequence language model that predicts cis-regulatory regions.The identification of cis-regulatory sequences that regulate gene expression, particularly enhancers and promoters, is critical given their impact on development and physiological functions. Identifying these sequences is a major challenge, and pre-trained models such as DNABERT and GROVER have been developed to improve their accuracy. For example, BERT-Promoter uses pre-trained BERT models to identify promoter activity and applies advanced machine learning algorithms to build the final predictive model. On the other hand, iEnhancer-BERT employs a transition learning approach based on DNABERT to enhance enhancer prediction and uses convolutional neural networks to classify feature vectors. These models show promising progress toward elucidating the mechanisms behind gene expression and identifying new DNA enhancers.

Thethird is the prediction of DNA-protein interactions:accurately identifying DNA-protein interactions is essential for understanding the regulation of gene expression and evolutionary processes. To address this important task, models such as DNABERT, DNABERT-2, and GROVER have been developed, which predict protein-DNA binding based on ChIP-seq data. TFBert is also a pre-trained model that performs well with minimal fine tuning. This model treats DNA like a natural language and effectively extracts contextual information to accomplish tasks efficiently. The MoDNA framework, on the other hand, incorporates common DNA functional motifs and acquires genomic representations through self-supervised pre-training, contributing to promoter prediction and transcription factor binding site prediction.

Thefourth is the prediction ofDNA methylation, a process that plays a central role in the epigenetic regulation of genes. Methylation patterns can be important markers in the diagnosis and treatment of disease. Several evolved models, notably BERT6mA, iDNA-ABT, iDNA-ABF, and MuLan-Methyl, predict different forms of methylation and these insights may lead to the development of new therapies. These models utilize advanced feature representation and learning algorithms to analyze complex patterns of DNA methylation.

Examples of applications of large-scale language models in transcriptomes

With the ongoing development of BERT-based language models for DNA, the challenge is to accurately capture evolutionary information from allogenic sequences. In particular,twoinnovative RNA-based models, RNA-FM and RNA-MSM,have emerged toaddress less conserved RNA sequences.

RNA-FM utilizes self-supervised learning to predict RNA secondary and 3D structures using an extensive data set containing 23 million non-coding RNA sequences. With this approach, RNA-FM effectively captures a variety of structural information about RNA sequences and provides a comprehensive understanding of the characteristics of these sequences.

RNA-MSM, on the other hand, utilizes automatically collected allogenic sequences from RNAcmap. The model is particularly good at accurate mapping of 2D base pairing probabilities and 1D solvent accessibility, allowing fine tuning for a variety of downstream tasks related to RNA structure and function.

One application of large-scale language models in the transcriptome is the prediction of RNA family classification and secondary structure using RNA sequence language models.RNA secondary structure prediction is a major challenge that scientists face in order to gain a deeper understanding of RNA folding rules. RNABERT was developed to address this challenge and has the potential to contribute to many applications, including the development of RNA-targeted drugs. The model combines tokenization, positional embedding, and transformer modeling, with a particular focus on predicting RNA secondary structure and RNA family classification. This ability to understand complex RNA folding rules and rapidly classify unknown RNA sequences into existing families represents an important advance in the study of new RNA molecules, and the use of RNABERT is an important tool for not only RNA structure prediction, but also for a better understanding of RNA biology as a whole RNABERT is an important tool not only for RNA structure prediction but also for a better understanding of the overall RNA biology.

The second is an RNA sequence language model used to predictRNA splicing, an essential process for gene expression in eukaryotes. To better understand this process, a pre-trained model called SpliceBERT has been developed. SpliceBERT not only captures the subtle nuances of RNA splicing, but also helps to identify potential mutations that prevent splicing. help to identify potential mutations that prevent splicing. The model provides a data-driven approach to help assess the impact of mutations and efficiently identify and prioritize important genetic variants. This capability is a valuable resource for researchers seeking to understand the impact on RNA splicing.

The third is an RNA sequence language model used to predict RNA modifications:post-transcriptional modifications of RNA play important roles in the cell, especially modifications such as N7-methylguanosine (m7G), which is essential for the regulation of gene expression. High-throughput experiments are accurate but costly and time consuming. To address this, BERT-m7G has emerged as a computational model to efficiently identify m7G sites from RNA sequences. This tool reduces the burden of experimental approaches and contributes to a better understanding of how m7G affects gene function.In addition, another RNA modification, 2'-O-methylation (Nm), is also important for cellular processes, and Bert2Ome is an efficient computational tool to predict this site directly from the RNA sequence. BERT-based models are used in combination with convolutional neural networks (CNN) to identify RNA modification sites and their functional relationships with a high degree of accuracy. This method greatly reduces the time required for experimental methods and aids in a new understanding of post-transcriptional modifications.

The fourth is an RNA sequence language model used to predict protein expression and mRNA degradation.mRNA vaccines are gaining attention for their cost-effectiveness and rapid development potential. codonBERT is specifically designed to predict protein expression in mRNA sequences and is multi CodonBERT is specifically designed to predict protein expression of mRNA sequences and is pre-trained on a wide range of data sets using a head-attention transformer architecture. This pre-training gives CodonBERT superior performance in predicting protein expression and degradation of mRNAs and the ability to incorporate new biological information in the design of mRNA vaccines. This model opens up new possibilities in the field of immunization and contributes to more efficient vaccine development.

Applications of large-scale language models in protein research

Proteins are essential molecules for the maintenance of life and are the basis for a wide variety of physiological processes. As science progresses, so does the accumulation of protein data. Large-scale language models have emerged as an effective means of extracting useful information from such data.

These models, known as pre-trained protein language models (PPLMs), learn features from data such as protein sequences, gene ontology annotations, and property descriptions. The learned features are then applied to various tasks such as protein structure prediction, Post-Translational Modifications (PTM), and biophysical property evaluation.

Antibodies are also a type of protein, but the datasets and tasks used to study them are different from those of general proteins. With the growing number of Observed Antibody Space (OAS) databases, large-scale language models (PALMs) specific to antibodies are being developed to study the binding mechanisms of therapeutic antibodies, the evolution of immunity, and the discovery of new antibodies. These models are being used for tasks as diverse as predicting specific antibody sites (paratopes), analyzing the maturation process of B cells, and classifying antibody sequences.

One application of large-scale language models in protein research is protein language models for protein secondary structure and contact prehension.Protein structure plays a crucial role in their function and interactions. However, protein structure analysis using traditional laboratory techniques is a time-consuming and labor-intensive process. To solve this problem, advances in deep learning technology have led to the emergence of large-scale language models for predicting protein structures. For example, the MSA Transformer is a model that uses multiple sequence alignments to apply a unique row and column attention mechanism to input sequences. This model outperforms traditional unsupervised approaches and has improved parameter efficiency. ProtTrans also uses data from UniRef and BFD to train multiple models, which has achieved remarkable progress in predicting secondary structure.

The second is a protein language model for protein sequence generation.Protein generation techniques have a wide range of applications, including drug design and protein engineering. Modern large-scale language models have the ability to form stable three-dimensional structures with specific functional properties when generating protein sequences. the ProGen model uses UniprotKB keywords as conditional tags and a rich vocabulary of over 1,100 terms to It generates protein sequences with a rich vocabulary of more than 1,100 terms. In addition, ProtGPT2 generates proteins that follow natural amino acid principles, many of which exhibit globular properties. and that ProtGPT2 has mastered the protein-specific language.

The third is an antibody large-scale language model for predicting antigen-receptor binding and antigen-antibody binding.Antigen proteins are degraded in the cytoplasm to form novel antigenic peptides. These peptides bind to the major histocompatibility complex (MHC), form the pMHC complex, and are transported to the cell membrane where they are presented; the T cell receptor (TCR)recognizesthemand stimulates B cells to produce specific antibodies, triggering an immune response. An important part of this process is predicting how precisely the peptide will bind to the HLA molecule.

For example, MHCRoBERTa allows differentiation between different alleles based on input amino acid sequences, but this model is specifically focused on pMHC-I binding prediction. BERTMHC, on the other hand, has been trained on data containing 2,413 MHC-peptide pairs and shows progress in filling the previous gap in pMHC-II binding prediction.

Another major goal is to predict the binding specificity of adaptive immune receptors (AIRs) to antigens. This specificity is mainly due to the flexibility of the loop of three complementarity-determining regions called CDR1-3, and TCR-BERT learns the general representation of the TCR from unlabeled TCR CDR3 sequences to predict antigen specificity. However, this model has not been successful in understanding AIR pair interactions. This problem was effectively solved by using SC-AIR-BERT, a BERT model specifically designed by Jianhua Yao et al. that outperforms other methods in the task of predicting antigen binding specificity of TCR and BCR.

Recent work on antibody language models has also received much attention. For example, AbLang is built on RoBERTa and focuses on specific challenges, particularly the recovery of residues lost during the sequencing process. This model outperforms other models in its ability to accurately recover missing residues in antibody sequences.

In addition, AntiBERTa uses latent vectors derived from protein sequences to gain some understanding of the "language" of the antibody and effectively perform tasks as diverse as tracking the B-cell origin of the antibody, quantifying immunogenicity, and predicting binding sites EATLM, by introducing additional pre-learning tasks, new approaches to incorporate specific biological mechanisms.

Applications of Large-Scale Language Models in Drug Discovery

Drug discovery is known to be a low success rate, expensive, and time consuming process. At this early stage, computer-aided drug discovery, which combines experience and expertise with algorithms, machine learning, and deep learning, is accelerating the generation and screening of drug molecules and lead compounds. This speeds up the entire development process, especially for small molecule drugs, which often account for most (up to 98%) of the drugs on the market.

Small molecule drugs have excellent spatial dispersion in their structures, and their chemical properties are considered to have favorable drug-like and pharmacokinetic properties. Advances in deep learning and the introduction of large-scale language models have facilitated the use of these techniques to discover patterns and interactions of small molecules, proteins, RNA, and other molecules with their targets.

Specifically, SMILES strings and chemical fingerprints are commonly used to represent molecules. In addition, the pooling process of graph neural networks (GNNs) is used to convert small molecules into sequential representations, and large-scale language models act on this information for various aspects of drug discovery. In this way, the efficiency and accuracy of new drug discovery is improved.

This approach has contributed significantly to reducing costs and speeding up processes in the area of drug discovery, opening up new possibilities for the future of medicine.

One application of large-scale language models in drug discovery is that covering a huge number of drug-like chemical spaces (an estimated 10 to the 63rd power or more compounds) is very difficult as a practical challenge. Traditional virtual screening libraries contain less than 10 to the 7th power of compounds and are sometimes unavailable. To address this problem, deep learning methods have emerged as an effective approach to generate molecules with drug-like properties. In particular, the MolGPT model, inspired by the pregenerative learning model GPT, extends the capabilities of conditional generation by incorporating the task of next token prediction as well as the additional training task of conditional prediction. The model not only generates innovative and effective molecules, but also enhances the ability to understand the statistical properties within the data set.

The second application of large-scale language models in drug discovery is that combination therapies for complex diseases such as cancer, infectious diseases, and neurological disorders are common and often more effective than single drug treatments. Accurately predicting the synergistic effects of drug pairs is essential to improving therapeutic efficacy, but is challenging due to the large number of drug combinations and complex biological interactions. In this area, the DCE-DForest model developed by Wei Zhang and colleagues encodes drug SMILES using a pre-trained drug BERT model and predicts synergies from drug and cell line embedding vectors using a deep forest approach. In addition, Mengdie Xua and colleagues fine-tune the pre-trained large-scale language model to effectively predict drug pair synergies by using a dual feature fusion mechanism. This includes drug molecular fingerprinting, SMILES encoding, and cell line gene expression data, and elimination analysis confirms that fingerprinting input plays an important role in the quality of drug synergy prediction.

Examples of applications of large-scale language models in single-cell analysis

Single-cell RNA sequencing (scRNA-seq) marks the beginning of a new era in genomics and biomedical research. Unlike traditional bulk RNA sequencing, scRNA-seq can unravel the details of gene expression at the single cell level, which has led to unprecedented insights and many breakthroughs [127-130]. One of the most notable changes resulting from this technology is the ability to detail the diversity of cells within a tissue or organism. Through scRNA-seq, diverse cell types and rare cell states that are often overlooked by conventional methods can be revealed.

As mentioned earlier, large-scale language models have been used successfully in a variety of fields such as genomics, transcriptomics, proteomics, and drug discovery. Here we show how these models are being applied in the field of single cell analysis. Single-cell language models can be used for a wide variety of downstream tasks such as identifying cell types and states, discovering new cell populations, estimating gene regulatory networks, and even integrating single-cell multi-omics data.

One application of large-scale language models in single-cell analysis is single-cell language models for single-cell clustering based on scRNA-seq data.Cell clustering by single cell RNA sequencing (scRNA-seq) is an important method for deciphering cellular diversity within a biological sample. It allows individual cells to be divided into clusters based on their gene expression profiles. Large-scale language models allow efficient clustering using a wide range of scRNA-seq data from different tissues and species. For example, the tGPT model learns feature representations based on highly expressed genes, which has been applied to cell clustering on large datasets such as the Human Cell Atlas and Tabula Muris. In addition, scFoundation utilizes a transformer-based encoder-decoder structure to learn cellular embeddings from unmasked and nonzero gene data, which is then used for clustering.

The second is a single-cell language model for gene function analysis based on scRNA-seq data.Large-scale language models have also been applied to gene function analysis. These models learn relationships between genes using a transformer attention mechanism and generate gene embeddings through pre-training and fine tuning. These embeddings may be used for gene expression prediction and genetic perturbation prediction. scGPT acts as a feature extractor using zero-shot learning and contributes to the inference of gene regulatory networks. Geneformer, on the other hand, is trained on extensive single-cell transcriptome data and fine-tuned for a variety of downstream tasks, including prediction of chromatin dynamics and network dynamics. These models provide highly accurate predictions through the transfer of pre-trained weights to task-specific models with limited data.

The third is a single-cell language model for single-cell multi-omics data.The study of single-cell multi-omics data offers many advantages over single-omics data types by integrating information from different omics technologies such as genomics, transcriptomes, epigenomes, and proteomes at the single-cell level. In the analysis of such data, large-scale language models leverage their adaptability, generalization, and feature extraction capabilities to provide solutions to the challenges of data variability, scarcity, and cellular heterogeneity.

The model, scGPT, deals with the diversity of the dataset by using a set of tokens that represent different sequencing methods when integrating scMulti-omics data. These modality tokens are associated with input features such as genes or proteins and incorporated into the transformer output to increase the accuracy of data processing. This ingenuity allows for proper evaluation of features in different modalities while avoiding excessive attention to features within the same modality.

Of particular note is a tool called scMVP, which is specifically designed to integrate single-cell RNA-seq and ATAC-seq data, whereby gene expression and chromatin accessibility are analyzed in the same cell. scMVP projects these data into latent space and uses cell type-guided attention function to compute correlations between the data.DeepMAPS, on the other hand, is a graft transformer-based method for biological network inference and data integration from scMultiomics data including scRNA-seq, scATAC-seq, and CITE-seq. The method constructs graphs with genes and cells as nodes and learns regional and global features to build relationships between cells and genes.

In addition, scTranslator enables the conversion of single cell transcriptome data to proteome data and accurately infers protein abundance by minimizing differences between predicted and actual proteins. scMoFormer enables the conversion of gene expression to conversion to protein abundance, but also to predict multi-omics data to elucidate the dynamic interactions between different biological information.

Thus, large-scale language modelsare playing an important role inthe field of single-cell analysis, expanding research possibilities. These evolving tools suggest new possibilities for unraveling biological complexity and paving the way for precision medicine.

Summary

Pre-trained large-scale language models are revolutionizing diverse challenges in biology. This papersurveys applications oflarge-scale languagemodels ingenomics, transcriptomics, proteomics, single-cell analysis, drug discovery, and many other areas.

Large-scale languagemodels analyze DNA and RNA sequences and can predict modifications and regulation on this basis. Significant progress has also been made in the field of proteomics, including the prediction of protein structures and interactions. In particular, information from scRNA-seq and scMulti-omics data has contributed to the identification of cell types, the integration of data sets, and the prediction of gene-related functional analyses.

In drug discovery,large-scale languagemodels arealsoused to predict molecular properties and to predict the generation of new molecules and drug interactions. For example, DNABERT is trained specifically for DNA analysis and may be applied to RNA analysis; models such as M6A-BERT-Stacking are specialized to identify RNA modification sites and can make highly accurate predictions.

In the field of protein research, protein language models based on sequence data provide detailed analysis of protein function and provide useful information for researchers. However, these models require a large number of parameters, making deployment a challenge. A partial solution is to utilize large-scale models online or a distillation-based approach.

Thus,large scale languagemodels are opening up new possibilities as a powerful tool for analyzing complex problems in molecular biology, from analyzing DNA mutations and mRNA abundances to discovering new causal relationships.

In addition, the development of large-scale language models has brought new challenges in integrating diverse information modalities, such as protein 3D structural information. Approaches to convert this information into sequence-based formats and methods to integrate multiple large models to capture multimodal information are being investigated. The selection of multimodal fusion techniques and timing is critical to this.

In the field of drug discovery,the use oflarge-scale languagemodels has led to the demand for predictions based not only on molecular sequence information but also on their spatial structure, and new models are being built to improve prediction accuracy. For example, large-scale graphical models using the CrossDocked2020 dataset have been developed.

In addition,large-scale languagemodels can apply the insights gained from the prediction of protein-protein interactions (PPI) and cell-cell interactions (CCI) to the prediction of drug-target interactions (DTI). This technology is also evolving in the generation of drug molecules, taking into account properties such as efficacy and novelty.

The application of large-scale language models insingle-cell analysisreduces the scarcity problem, especially for scRNA-seq data, and streamlines the training of models based on large gene expression data. Defining gene location and overcoming batch effects in integrating data from different sequencing technologies are also important challenges. The combination of graph neural networks (GNNs) and transformers is driving innovative advances in single cell analysis, contributing to the analysis of complex cell-gene interactions.

DeepMAPS is also a model that uses a graft transformer to assess the importance of genes within a cell in order to understand the interactions between cells and genes. The technology leverages a combination of graph neural networks (GNNs) and transformers to comprehensively represent the complex relationships and dependencies that exist within single cell data; GNNs are better suited to capture local interactions in the vicinity of cells, while transformers can more effectively capture broader dependencies effectively capture broader dependencies.

This synergy helps to understand the overall cellular landscape and improve feature learning. As a result, large-scale language models can effectively learn relevant information such as gene expression patterns and cell types from raw data without prior domain-specific knowledge.

Today's large-scale language models have reached a very sophisticated level in their ability to model the complexities of molecular biology. The evolution of single cell technology and the development of omics science, including proteomics, metabolomics, and lipidomics, are enabling more efficient measurement techniques. This has enhanced our ability to unravel the complexity of molecular layers from DNA to the details of human physiology.

Further exploration of the realm of cutting-edge technologies is expected to yield new insights that will lead to a comprehensive understanding of dynamic interactions at the molecular level.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

From DNA Analysis To Gene Expression Prediction And Large-scale Language Modeling For Bioinformatics

Summary

Examples of applications of large-scale language models in bioinformatics

Examples of applications of large-scale language models in transcriptomes

Applications of large-scale language models in protein research

Applications of Large-Scale Language Models in Drug Discovery

Examples of applications of large-scale language models in single-cell analysis

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...