De Novo Antibody Design with RFantibody in Google Colab

A practical guide to designing novel antibodies with diffusion.

Colab Notebook Link

Explore 15 novel SARS-CoV-2 spike protein antibodies generated by the methodology described in this post.

Introduction

The traditional approach to drug discovery relies on screening vast molecular libraries, testing millions of compounds in hopes of finding one that works. This process is both expensive and time-consuming, often taking years to identify promising candidates. De novo protein design offers a different paradigm entirely: instead of searching through existing molecules, we can design new ones from scratch, tailored for specific therapeutic purposes.

Recent advances in generative machine learning have begun to transform this vision into reality. By learning the underlying rules governing protein structure and function, models can conditionally generate entirely novel molecules with desirable properties. This post demonstrates this process through a practical example, showing how to design a new antibody against the SARS-CoV-2 spike protein using RFantibody, a state-of-the-art diffusion model specifically engineered for antibody design.

Generative Protein Design

The revolution in computational biology began with structure prediction. DeepMind's AlphaFold demonstrated that machine learning could solve the protein folding problem, one of the hardest open problems in biology. This breakthrough showed that neural networks could internalize the complex physical and chemical principles governing protein structure with remarkable accuracy.

Generative models represent the next step. While predictive models answer what will this protein look like, generative models tackle the inverse problem, designing a protein with a desired structure. This shift from analysis to synthesis opens unprecedented possibilities for therapeutic development, materials science, and basic biology research.

Among the various generative approaches, diffusion models have emerged as particularly powerful tools for protein design. These models work by learning to reverse a controlled noise-adding process. During training, the model observes how a clear signal (such as a protein structure) gradually degrades as random noise is systematically added. The neural network learns how to perform denoising, reversing each step of this degradation process. To generate a new structure, the trained model starts with pure random noise and applies the learned denoising process step by step. Each iteration removes a small amount of noise while preserving the underlying structural principles learned during training. The result is a novel, coherent design that follows the same physical and chemical rules as natural proteins.

RFdiffusion, developed at the University of Washington's Institute for Protein Design, represents a major achievement in computational protein design. The foundational work from David Baker's group, which earned the 2024 Nobel Prize in Chemistry, demonstrated that entirely novel protein backbones could be generated through this diffusion-based approach. Unlike earlier methods that typically modified existing protein scaffolds, RFdiffusion generates structures de novo. The model transforms random noise into a protein backbone that is complementary to a specified target site. This capability to create binders from scratch, rather than modifying existing ones, represents a significant advance in protein engineering.

The model's training used thousands of experimentally-determined protein structures, allowing it to internalize the geometric and energetic principles that govern protein stability and function. By conditioning the generation process on specific constraints (such as binding to a particular target), RFdiffusion can produce designs that satisfy both structural plausibility and functional requirements.

Antibody Structure and Function

Antibodies represent one of evolution's most sophisticated molecular recognition systems. These Y-shaped proteins serve as the targeted defense mechanisms of the adaptive immune system, capable of recognizing and neutralizing almost any foreign substance. Beyond their natural role in immunity, antibodies have become one of the most important classes of therapeutic drugs, with treatments like adalimumab (Humira) for autoimmune diseases and numerous antibody-based cancer therapies demonstrating their clinical potential.

The characteristic Y-structure consists of two identical heavy chains and two identical light chains, held together by disulfide bonds. Each arm of the Y contains an antigen-binding site formed by the pairing of variable domains from the heavy and light chains (VH and VL, respectively). The smallest functional unit that retains full binding capability is the Fragment variable (Fv), comprising these paired variable domains.

The exquisite specificity of antibody binding arises from a set of hypervariable loops called Complementarity-Determining Regions (CDRs). These loops, three on each chain (CDR-H1, CDR-H2, CDR-H3 on the heavy chain and CDR-L1, CDR-L2, CDR-L3 on the light chain), form the direct interface with the target antigen. Unlike the relatively rigid framework regions that provide structural stability, the CDRs are conformationally flexible, allowing them to adapt to the shape and chemical features of their targets.

The diversity of the antibody repertoire stems primarily from genetic recombination processes that generate countless variations in CDR sequences and structures. In natural immune responses, B cells undergo somatic hypermutation and affinity maturation to optimize their binding properties. This evolutionary process provides the template for computational antibody design: generate diverse CDR structures and select those with optimal binding properties.

The molecular targets of antibodies, known as antigens, can be any foreign substance, but in therapeutic contexts, they are typically specific proteins on pathogens, cancer cells, or other disease-relevant targets. The precise location where an antibody binds is called the epitope, which may be a continuous stretch of amino acids or a conformational site formed by distant residues brought together by protein folding.

Epitope selection represents a critical decision in therapeutic antibody design. Functional epitopes, such as those that block viral entry or disrupt protein-protein interactions, often provide better therapeutic efficacy compared to purely structural targets. For neutralizing antibodies against viruses, epitopes in receptor-binding domains are particularly valuable because they directly interfere with the infection mechanism.

RFantibody: Specialized Design for Immune Recognition

While RFdiffusion excels at generating stable protein structures based on regular secondary structure elements like alpha-helices and beta-sheets, antibody binding presents unique challenges. The recognition interface is dominated by the flexible CDR loops, which adopt conformations dictated more by target complementarity than by intrinsic stability. These loops often exist in multiple conformational states and may undergo significant structural rearrangement upon binding.

General protein design models, trained primarily on single-chain proteins with well-defined folds, struggle with the conformational flexibility and inter-chain interactions that characterize antibody-antigen complexes. The binding interface involves subtle geometric complementarity, optimal electrostatic interactions, and carefully balanced hydrophobic and hydrophilic regions.

To address these limitations, researchers developed RFantibody, a specialized version of RFdiffusion fine-tuned specifically on antibody-antigen complexes (Bennett et al., 2024). The training dataset consisted of thousands of experimentally determined structures of antibodies bound to their targets, providing the model with extensive examples of successful molecular recognition.

RFantibody's architecture incorporates several key innovations. The model learns to generate CDR loop conformations while keeping the antibody framework and target antigen fixed, focusing its generative capacity on the variable regions responsible for binding specificity. The conditioning mechanism allows users to specify target epitopes precisely, guiding the generation process toward regions of functional importance.

The model's training encompassed diverse antibody formats and target types, from small molecule haptens to large protein antigens. This breadth enables the generation of antibodies against a wide range of therapeutic targets while maintaining the structural principles that govern stable antibody-antigen interactions.

The RFantibody pipeline extends beyond backbone generation to include sequence design and validation steps. After generating the 3D coordinates of the CDR loops, the next challenge is determining which amino acid sequence would adopt this desired structure.

ProteinMPNN, a specialized neural network for sequence design, addresses this inverse folding problem (Dauparas et al., 2022). Trained on thousands of protein structures and their corresponding sequences, ProteinMPNN learns the relationship between backbone geometry and amino acid identity. Given a desired backbone structure, the model predicts which sequence is most likely to fold into that conformation while maintaining the necessary binding interactions.

The final component involves structural validation using a specialized version of RoseTTAFold2. This step serves as a consistency check: starting from the designed sequence, the structure prediction model generates a 3D structure that should closely match the original designed backbone. Significant deviations indicate potential problems with the design, such as sequences that prefer alternative conformations or structures that violate physical constraints.

Practical Implementation: The Colab Walkthrough

The Colab notebook provides a complete, executable implementation of the RFantibody pipeline. 15 new SARS-CoV-2 spike protein antibodies generated using the notebook (with 100 diffusion steps) are available with the interactive viewer at the top of the page. The initial setup involves installing the necessary computational libraries, including specialized packages for protein structure manipulation and machine learning frameworks optimized for diffusion models. The process also requires downloading the pre-trained model weights, which encode the learned patterns from thousands of antibody-antigen structures.

The core inputs to the system consist of a target antigen and an antibody framework. For this tutorial, we focus on the receptor-binding domain (RBD) of the SARS-CoV-2 spike protein (PDB: 6M0J), a medically relevant target that has been extensively characterized both structurally and functionally. This domain is responsible for the virus's initial interaction with human cells, making it an ideal target for neutralizing antibodies.

The antibody framework comes from adalimumab (PDB: 4NYL), the active component of Humira. Using a well-characterized therapeutic antibody provides several advantages: the framework has proven stability and developability properties, extensive experimental validation exists for comparison, and the structure represents a clinically validated design template.

The notebook employs a specialized HLT file format, a modified PDB structure where heavy chain, light chain, and target chains are explicitly labeled. This annotation is crucial for the model to distinguish between regions that should be held fixed (the framework and target) and those that should be redesigned (the CDR loops).

The design process involves several critical parameters that control both the scope and quality of the generated antibodies:

The ppi.hotspot_res parameter specifies the target epitope by identifying specific residues on the antigen surface. This guidance mechanism ensures that the generated antibodies focus on functionally relevant sites rather than arbitrary surface regions. For the SARS-CoV-2 RBD, selecting residues involved in ACE2 receptor binding creates the potential for neutralizing activity.

The antibody.design_loops parameter controls the structural diversity of the generated designs by setting allowed length ranges for each CDR. Natural antibodies show significant variation in loop lengths, particularly for CDR-H3, which can range from 4 to over 20 residues. By sampling from these realistic ranges, the model explores the conformational space accessible to natural immune systems.

The inference.num_designs parameter determines the size of the generated library. Larger libraries increase the likelihood of finding high-quality candidates but require additional computational resources and more sophisticated filtering strategies.

Diffusion-specific parameters (T and final_step) control the noise schedule and generation process. These technical parameters affect the balance between structural diversity and design quality, with longer diffusion processes generally producing more refined structures at the cost of computational time.

During generation, RFantibody operates within a carefully constrained framework. The antibody scaffold and target antigen remain fixed, providing geometric and chemical context for CDR design. The model's task focuses exclusively on generating 3D coordinates for the six CDR loops, exploring the vast conformational space to identify structures that optimize complementarity with the specified epitope.

The diffusion process begins with random noise in place of the CDR coordinates and gradually refines these positions through hundreds of denoising steps. At each iteration, the model considers the current state of all six loops simultaneously, allowing for cooperative optimization of the binding interface. This holistic approach captures the interdependencies between different CDRs that characterize high-affinity antibody-antigen interactions.

Evaluation and Validation Strategies

The primary challenge in computational protein design lies in distinguishing promising candidates from the thousands of generated structures. Experimental validation remains expensive and time-consuming, making effective computational filtering essential for practical applications.

The notebook employs mean predicted Local Distance Difference Test (pLDDT) scores as the primary filtering metric. Originally developed for AlphaFold, pLDDT provides a per-residue confidence estimate for predicted structures (Jumper et al., 2021). Higher pLDDT scores indicate greater model confidence in the local structural environment, suggesting that the designed geometry is consistent with known protein structures.

While pLDDT scores correlate with structural plausibility, they do not directly predict binding affinity or functional activity. A high-confidence structure may still fail to bind effectively due to suboptimal chemical complementarity, inappropriate electrostatic interactions, or other factors not captured by geometric considerations alone. Nevertheless, pLDDT filtering provides valuable initial screening to eliminate obviously problematic designs.

The final evaluation step involves qualitative analysis through 3D visualization using py3Dmol. This visual inspection serves multiple purposes: it allows assessment of surface complementarity between the designed CDRs and target epitope, identification of potential atomic clashes that would indicate physically unrealistic structures, and evaluation of the chemical environment at the binding interface.

Effective antibody-antigen interactions typically show extensive surface complementarity, with the CDR loops conforming closely to the target surface topology. Large gaps or poor shape matching suggest suboptimal binding potential. Conversely, atomic clashes, where atoms approach closer than their van der Waals radii allow, indicate unstable or impossible structures that require redesign.

The chemical composition of the interface also provides important insights. Successful designs often show complementary charge distributions, appropriate hydrophobic contacts, and opportunities for hydrogen bonding. While these features can be difficult to optimize explicitly during generation, visual inspection can identify particularly promising or problematic chemical environments.

Future Directions and Limitations

Despite the impressive capabilities demonstrated by RFantibody, several important limitations remain. The model's training data, while extensive, may not capture all possible binding modes or unusual antibody architectures. Designs targeting novel epitopes or employing non-canonical binding mechanisms may fall outside the model's experience.

The evaluation metrics currently available provide limited insight into functional properties beyond structural plausibility. Binding affinity, specificity, stability, and developability all represent crucial characteristics for therapeutic antibodies, but these properties remain difficult to predict computationally with high accuracy.

Experimental validation continues to represent a bottleneck in the design cycle. While computational generation can produce thousands of candidates rapidly, laboratory testing remains essential for confirming designed properties and identifying unexpected issues.

Future developments in computational antibody design will likely focus on several key areas. Improved evaluation metrics that better predict experimental success would dramatically enhance the efficiency of the design process. Integration of additional design objectives, such as stability, specificity, and manufacturability, would produce more developable therapeutic candidates.

The extension to more complex antibody formats represents another important frontier. Current methods focus primarily on conventional antibodies, but therapeutic applications increasingly employ engineered formats such as bispecific antibodies, antibody-drug conjugates, and single-domain antibodies. Adapting generative models to these alternative architectures would expand the scope of computational design.

Conclusion

This walkthrough demonstrates a complete de novo antibody design workflow, from target selection through generation to initial evaluation. By leveraging RFantibody's capabilities within an accessible Colab environment, researchers can explore the cutting edge of computational protein design without requiring extensive technical infrastructure.

The process illustrates both the remarkable capabilities of current generative models and the challenges that remain in translating computational designs into functional therapeutics. The ability to generate structurally plausible antibody candidates against specified epitopes represents a significant advance over traditional discovery approaches, potentially accelerating the development of new treatments for infectious diseases, cancer, and other medical conditions.

The tutorial provides a foundation for further exploration and experimentation. Students and researchers should try to adapt the approach to different targets, exploring alternative antibody frameworks, or investigating different targets (there are many structures available in the Protein Data Bank). By engaging with these computational tools, the scientific community can contribute to advancing the next generation of protein-based therapeutics while gaining insights into the fundamental principles governing molecular recognition.

As generative models continue to improve and experimental validation becomes more efficient, computational antibody design will likely play an increasingly central role in therapeutic development. The mathematical foundations already demonstrate the potential for creating novel molecular entities with designed properties, pointing toward a future where therapeutic discovery is limited more by imagination than by the constraints of existing molecular libraries.