RNA modifications are critical post-transcriptional events that alter RNA activity, location, and stability by modifying a specific nucleotide through the action of RNA-binding proteins1-2. To date, over 100 types of RNA modifications have been identified, with some implicated in the development of cancers, cardiovascular disorders, and other diseases3-5. Although recent technological advancements have significantly increased our capacity to identify these modifications, existing analysis pipelines are restricted to known modification motifs6. In this study, we present a deep learning framework capable of accurately identifying RNA sites likely to undergo seven different modification types, including N6-methyladenosine (m6A), Pseudouridine (ψ), 1-Methyladenosine (m1A), 2’-O-methyladenosine (Am), 2’-O-methylcytidine (Cm), 2'-O-methylguanosine (Gm), and 2’-O-methyluridine (Um).
We curated publicly available experimental datasets7-8 and characterised the modification sites from three aspects - RNA sequences, conservation level, and geographic location. RNA sequence descriptors were generated using one-hot encoding, iFeatures9, and an optimised transformer-based machine learning technique for natural language processing (RNABERT)10. PhyloP and PhastCons scores were used to reflect the conservation status of modification sites and adjacent sites. We employed the geographic position of modification sites with regard to transcript structures to enhance the prediction performance.
The model performed well across cross-validation and independent blind tests, offering a potent tool for analysing RNA modification sites and allowing genome-wide predictive mapping. This framework expands our ability to identify RNA modifications and has the potential to facilitate advances in therapeutic applications.