IN THIS LESSON

Starting our exploration of machine learning

Machine learning (ML) is transforming how we analyze biological data. From predicting disease outcomes to discovering drug candidates, ML models can uncover hidden patterns and make predictions based on complex biological data. In this lesson, we’ll walk through how to get started with Python for ML by introducing Google Colab and various essential Python packages.

Introduction

Machine learning (ML) is transforming how we analyze biological data. From predicting disease outcomes to discovering drug candidates, ML models can uncover hidden patterns and make predictions based on complex biological data. In this lesson, we’ll walk through how to get started with Python for ML in computational biology. We’ll use Google Colab, a free cloud-based platform, and explore essential Python libraries such as NumPy, Pandas, Scikit-learn, and Matplotlib. We’ll explain how each is used both in biological contexts and in machine learning.

Setting Up: Google Colab

Google Colab is an easy way to run Python code without needing to install anything on your computer.

Getting Started with Google Colab

To start programming with Python today, visit Google Colab. You will need a Google account. Then, create a new notebook by selecting New notebook. You can now begin coding by typing in the code cells and running them using Shift + Enter.

Installing and Using Key Python Packages

To work with machine learning in biology, Python offers several powerful libraries. We will introduce four major libraries: NumPy, Pandas, Scikit-learn (Sklearn), and Matplotlib. These are essential for processing data, building ML models, and visualizing results.

1. NumPy: Handling Numerical Data Efficiently

Biological Applications: NumPy is crucial for managing and processing large biological datasets such as DNA sequences, gene expression data, or protein structures. It allows you to perform mathematical operations quickly and efficiently, which is vital for ML in biology.

Machine Learning Applications: In machine learning, NumPy is commonly used for handling arrays of data (like feature matrices) and for mathematical computations such as linear algebra, which underpins many ML algorithms (e.g., matrix multiplication, eigenvectors).

Installing NumPy:
!pip install numpy
Example Usage in Biology and ML:
import numpy as np

# Create a dataset representing gene expression levels for 100 genes across 50 samples
gene_data = np.random.rand(50, 100)

# In ML, this could be the feature matrix (X)
X = gene_data

# Perform a common ML operation: normalization
X_normalized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

print("Normalized gene expression data:", X_normalized)

Summary: In biology, NumPy helps you handle and manipulate large datasets. In ML, it’s used for preprocessing, normalizing data, and performing operations like matrix multiplication, which are critical for training models.

2. Pandas: Data Manipulation and Exploration

Biological Applications: Pandas is ideal for working with structured biological data, such as patient records or genomic information in CSV or Excel formats. It allows you to load, manipulate, and clean data before feeding it into machine learning models.

Machine Learning Applications: Pandas is used to organize data into a structure (like a DataFrame) that is easy to manipulate. In ML, you often need to preprocess data by removing missing values, encoding categorical variables, or splitting data into training and test sets.

Installing Pandas:
!pip install pandas
Example Usage in Biology and ML:
import pandas as pd

# Load a CSV file containing biological data, such as patient information
patient_data = pd.read_csv('patient_data.csv')

# Preprocess the data for ML: drop missing values and convert categorical data
patient_data_clean = patient_data.dropna()
patient_data_clean['disease'] = patient_data_clean['disease'].astype('category').cat.codes

# In ML, this is used to prepare the dataset for training
X = patient_data_clean.drop('disease', axis=1)
y = patient_data_clean['disease']
print(X.head(), y.head())

Summary: Pandas is essential for cleaning and preparing biological data, and in ML, it’s used for preprocessing and structuring datasets before they are used to train models.

3. Scikit-learn (Sklearn): Building Machine Learning Models

Biological Applications: In computational biology, Scikit-learn helps with tasks like classifying disease types, predicting drug responses, or clustering protein families based on structural similarities.

Machine Learning Applications: Scikit-learn is a core library for building, training, and evaluating machine learning models. It provides algorithms for classification, regression, clustering, and dimensionality reduction, making it indispensable for ML tasks in biology.

Installing Scikit-learn:
!pip install scikit-learn
Example Usage in Biology and ML:

Let’s classify patients based on gene expression data into different disease types using a Support Vector Machine (SVM) classifier.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Simulated gene expression data (100 samples, 20 features)
X = np.random.rand(100, 20)
# Simulated disease labels (binary classification: 0 = healthy, 1 = disease)
y = np.random.choice([0, 1], size=100)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a Support Vector Machine classifier
model = SVC()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Summary: Scikit-learn is the workhorse for building and training ML models in biology. Whether predicting cancer progression or clustering protein structures, it provides the necessary algorithms for supervised and unsupervised learning tasks.

4. Matplotlib: Visualizing Biological and Machine Learning Data

Biological Applications: Visualization is critical in biology for understanding patterns, such as changes in gene expression over time, or visualizing clusters of patients based on their genetic data.

Machine Learning Applications: In ML, Matplotlib is used to visualize model results, plot decision boundaries, and create graphs of accuracy, loss, or feature importance.

Installing Matplotlib:
!pip install matplotlib
Example Usage in Biology and ML:
import matplotlib.pyplot as plt

# Simulate gene expression data for 3 genes across 10 samples
gene_expression = np.random.rand(10, 3)

# Plot the gene expression data
plt.plot(gene_expression)
plt.title("Gene Expression Levels")
plt.xlabel("Sample")
plt.ylabel("Expression Level")
plt.legend(["Gene 1", "Gene 2", "Gene 3"])
plt.show()

# In ML, visualize model accuracy
accuracy_history = [0.8, 0.85, 0.88, 0.9, 0.92]
plt.plot(accuracy_history)
plt.title("Model Accuracy over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.show()

Summary: Matplotlib helps visualize biological and machine learning data. In ML, it’s used to plot data distributions, evaluate models visually, and understand results in a more intuitive way.

Conclusion

By mastering these libraries, you can preprocess, visualize, and model biological data using machine learning. Whether handling large gene expression datasets with NumPy and Pandas or building predictive models with Scikit-learn, these tools empower you to apply machine learning to solve complex biological problems. In the next lessons, we’ll explore specific ML algorithms and their applications in real biological case studies.

We strongly encourage you to explore the user manuals linked below in "Additional Resources" as they will provide you with a better understanding of how these packages are used in practice.