This article contains a comprehensive list of resources and tools that will be useful for you in conducting real biological research. All of these resources are free to use, but some may require a school email to make a free account.
Resource Platforms
This list contains bioinformatics resource platforms available online. All of them offer many, many tools, but we’ve included only a few examples for brevity.
BioLib: BioLib is a versatile platform designed to facilitate the use of bioinformatics tools through a user-friendly graphical interface or direct integration with existing data pipelines. It enables researchers to perform complex data analyses securely and efficiently, leveraging both local and cloud computing resources. Here are some key features and components of BioLib:
DNA Analyzer: This tool enables the analysis of DNA sequences, including options for reverse complement analysis. It is designed to be accessible without requiring advanced programming skills, making it useful for a wide range of genomic studies.
BioLib SAMtools: A library for handling BAM/SAM formats, essential for next-generation sequencing data. It supports various bioinformatics languages, including Perl, Python, and Ruby, facilitating seamless integration into diverse workflows.
BioLib Emboss: Part of the European Molecular Biology Open Software Suite, this module provides a comprehensive set of tools for sequence analysis, phylogenetics, and molecular evolution, enhancing the capabilities of BioLib for detailed biological data analysis
Galaxy: Galaxy is an open-source, web-based platform designed for accessible, reproducible, and transparent computational biomedical research. It enables users, particularly biologists, to perform complex data analyses without requiring programming skills. Galaxy has an extensive suite of tools, from genomic analysis to protein modeling and epigenetics. Here are a few examples:
MAFFT: This is a multiple sequence alignment tool used in bioinformatics for aligning DNA, RNA, or protein sequences. It's particularly useful for genomics and phylogenetic studies.
BWA (Burrows-Wheeler Aligner): Software for mapping low-divergent sequences against large reference genomes, crucial for next-generation sequencing data analysis and variant calling.
GATK (Genome Analysis Toolkit): A set of tools for variant discovery in high-throughput sequencing data, essential for identifying SNPs and indels in genomic studies and personalized medicine.
HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts): Fast alignment program for mapping RNA-seq reads to reference genomes, designed to handle spliced alignments for comprehensive transcriptome analysis.
Job Dispatcher: Job Dispatcher is a robust tool developed by EMBL's European Bioinformatics Institute (EBI) designed to manage and distribute computational jobs across various bioinformatics tools and resources. It ensures efficient utilization of computational resources and streamlines the execution of complex bioinformatics workflows. Here are some key features and components of the Job Dispatcher:
Clustal Omega: A high-performance multiple sequence alignment tool that handles large datasets with ease, making it ideal for comparative genomics and phylogenetic analysis.
InterProScan: A powerful tool for protein sequence analysis and functional annotation, integrating multiple databases to provide comprehensive insights into protein domains and families.
HMMER: A software suite used for searching sequence databases for homologs of protein sequences and for making sequence alignments, essential for protein family analysis and evolutionary studies.
Softberry: Softberry is a leading developer of bioinformatics software tools for genomic and proteomic research, offering a comprehensive suite of programs for sequence analysis, gene prediction, and functional annotation. Their tools are widely used in academic, research, and commercial settings for various biomedical and agricultural applications. Here are some key features and components of Softberry's bioinformatics toolkit:
Fgenesh++: A highly accurate gene prediction pipeline for eukaryotic genomes, capable of identifying 91% of coding nucleotides with 90% specificity. It incorporates mRNA and protein support to improve prediction accuracy.
FGENESB: A bacterial gene and operon finding program with custom-made parameters for over 240 prokaryotic genomes, including 25 archaea. It's particularly useful for analyzing bacterial communities and metagenomics data.
Fprom: A promoter prediction program that can identify 80% of TATA promoters with high specificity and 50% of TATA-less promoters. It's valuable for identifying transcription start sites upstream of predicted genes.
MolQuest: An integrated bioinformatics software package that provides a user-friendly interface for various analyses, including primer design, gene prediction, promoter identification, and phylogenetic reconstruction. It's designed to be accessible for both researchers and students.
GeneMark GeneMark is a suite of gene prediction programs developed at the Georgia Institute of Technology, designed to identify protein-coding regions in genomic sequences. It utilizes sophisticated statistical models to accurately predict genes in various organisms, from prokaryotes to eukaryotes. Here are some key components of the GeneMark suite:
GeneMark-ES: An ab initio gene prediction program for eukaryotic genomes that uses an unsupervised training algorithm. It's particularly useful for newly sequenced genomes where little is known about the organism's gene structure.
GeneMarkS-2: A self-training gene finding program for prokaryotic genomes. It can automatically adjust to different gene expression patterns and GC content variations, making it highly adaptable for diverse bacterial and archaeal species.
MetaGeneMark: Specifically designed for metagenomic sequences, this tool can predict genes in mixed genomic samples from multiple organisms, crucial for environmental and microbiome studies.
GeneMark.hmm: This version uses hidden Markov models to improve prediction accuracy, especially for identifying short genes and precise gene starts in prokaryotic genomes.
GeneMark-EP+: An advanced eukaryotic gene prediction tool that integrates protein similarity information to enhance the accuracy of gene structure prediction in complex eukaryotic genomes.
EXPASY: Expasy is the bioinformatics resource portal operated by the SIB Swiss Institute of Bioinformatics, providing access to over 160 databases and software tools across various life science domains. It supports a wide range of research areas, including genomics, proteomics, structural biology, systems biology, and medicinal chemistry. Here are some key tools and resources available on Expasy:
UniProtKB: A comprehensive protein sequence and functional information database, essential for proteomics research and protein annotation.
Swiss-Model: An automated protein structure homology-modeling server, aiding in the prediction of 3D structures of proteins based on known templates.
ISMARA: A web service for the analysis of gene expression and epigenetic data, enabling the identification of regulatory motifs and transcription factors.
RSCB PDB: The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) is a comprehensive repository of experimentally determined 3D structures of biological macromolecules, including proteins, nucleic acids, and complex assemblies. It serves as the primary resource for structural biology data worldwide, offering tools for data deposition, search, analysis, and visualization. Here are some key features and tools available through the RCSB PDB:
Structure Search: Allows users to search for protein structures using various criteria such as sequence, structure similarity, or chemical components.
Ligand Explorer: A tool for visualizing and analyzing ligand-protein interactions within macromolecular structures.
Protein Feature View: Provides an integrated view of protein sequence, structure, and functional annotations.
Mol Star: A modern web-based molecular visualization tool for exploring 3D structures interactively.
BLAST: Enables sequence similarity searches against the structures in the PDB.
PDB-101: An educational resource offering structural views of biology for teachers, students, and the general public.
Individual Tools and Databases
AlphaFold Server: AlphaFold is an advanced artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, designed to predict the 3D structure of proteins from their amino acid sequences. It has revolutionized the field of structural biology by achieving unprecedented accuracy in protein structure prediction.
PyMOL: PyMOL is a powerful molecular visualization system developed by Warren Lyford DeLano and currently maintained by Schrödinger, Inc. It is widely used in structural biology for producing high-quality 3D images of small molecules and biological macromolecules. Here are some key features and components of PyMOL:
GOR: The GOR (Garnier-Osguthorpe-Robson) method is an algorithm for predicting the secondary structure of globular proteins from amino acid sequences. It predicts alpha helices, beta sheets, and random coils.
DeepTMHMM: Prediction of transmembrane protein secondary structure. DeepTMHMM is a deep learning-based tool for predicting transmembrane protein structure from amino acid sequences. It utilizes a combination of convolutional and recurrent neural networks to achieve high accuracy in identifying transmembrane helices, their orientation, and signal peptides in membrane proteins.
I-TASSER: I-TASSER (Iterative Threading ASSEmbly Refinement) is a computational tool for predicting the 3D structures of proteins from their amino acid sequences. It utilizes a hierarchical approach that combines threading, fragment assembly, and iterative refinement to generate accurate protein models, along with functional annotations derived from structural similarities.
InterPro: InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. It combines protein family, domain, and functional site information from multiple databases into a single resource, providing comprehensive insights into protein function, structure, and evolution.
CB-Dock2: CB-Dock2 is an advanced protein-ligand blind docking tool that combines structure-based and template-based approaches for improved accuracy. It automatically identifies binding sites, calculates optimal docking parameters, and utilizes information from similar known protein-ligand complexes to guide the docking process, achieving high success rates in predicting correct binding poses.
ZINC15: ZINC15 a database offering a comprehensive collection of over 230 million purchasable compounds for virtual screening in drug discovery. It provides 3D structures of commercially available compounds, along with advanced search capabilities for properties, biological activities, and substructures, making it a valuable resource for researchers in medicinal chemistry and computational drug design.
KEGG: KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive database resource that integrates genomic, chemical, and systemic functional information. It provides a collection of manually curated pathway maps, representing molecular interaction networks for various cellular processes and organism-specific variations. KEGG enables researchers to interpret omics data in the context of biological systems.