Biomolecular structure drives function, and computational capabilities have progressed such that the prediction and computational design of biomolecular structures is increasingly feasible. Because computational biophysics attracts students from many different backgrounds and with different levels of resources, teaching the subject can be challenging. One strategy to teach diverse learners is with interactive multimedia material that promotes self-paced, active learning. We have created a hands-on education strategy with a set of 16 modules that teach topics in biomolecular structure and design, from fundamentals of conformational sampling and energy evaluation to applications, such as protein docking, antibody design, and RNA structure prediction. Our modules are based on PyRosetta, a Python library that encapsulates all computational modules and methods in the Rosetta software package. The workshop-style modules are implemented as Jupyter Notebooks that can be executed in the Google Colaboratory, allowing learners access with just a Web browser. The digital format of Jupyter Notebooks allows us to embed images, molecular visualization movies, and interactive coding exercises. This multimodal approach may better reach students from different disciplines and experience levels, as well as attract more researchers from smaller labs and cognate backgrounds to leverage PyRosetta in science and engineering research. All materials are freely available at https://github.com/RosettaCommons/PyRosetta.notebooks.ABSTRACT
I. INTRODUCTION
Structural models of proteins and other biomolecules help explain the functions and properties. Methods for computational structure prediction (i.e., protein folding and docking, as well as interactions with nucleic acids, carbohydrates, and other biomolecules) have been successful in many cases and certainly useful to drive structural and functional research hypotheses (1). Design of biomolecules (i.e., protein design, prediction of mutational effects, and molecular complex design) has also exhibited many successes, with potential impacts in medicine, biology, biotechnology, materials, and chemistry (2). Thus, there is a need to disseminate these interdisciplinary methods to a broader audience. The use of student feedback and course evaluations in this study was reviewed and approved by the Homewood Institutional Review Board at Johns Hopkins University (HIRB 11185). Here, we present a set of workshops for teaching or self-study of biomolecular structure prediction and design.
II. SCIENTIFIC AND PEDAGOGIC BACKGROUND
Computational methods are a relatively inexpensive way to predict and manipulate biomolecular structures, especially when experimental methods prove difficult. There is a long history in biophysics of using computational modeling to better understand structure, dynamics, and function. In fact, the 2013 Nobel Prize in Chemistry was awarded for the pioneering contributions in quantum and molecular mechanics of complex chemical systems (3). There are now many available dynamic simulation tools for observing the behavior of biomolecules over time and predicting thermodynamic and kinetic properties from estimates of the system's partition function. Some of these tools include CHARMM, Schrödinger software suite, Molecular Operating Environment (MOE), NAMD, Amber, and Gromacs (4–9). A complementary approach to model biomolecules is with so-called structure prediction approaches. Instead of seeking a full description of all the states and kinetic rates of the system, these approaches seek the dominant, low-energy conformational state that is most relevant in biologic conditions (10). These methods often accelerate calculations with approximations, such as constant bond lengths and angles, implicit solvent models, and empirically tuned energy functions. In exchange for these approximations, structure prediction approaches can capture the structure of large biomolecules in equilibrium without necessitating simulations over long timescales. These approaches are fundamentally based on optimization of an energy function in a very large conformational space. The same algorithmic components can then be used in reverse to design biomolecules by optimizing the energy function across different biomolecular sequences.
One leading structure prediction and design software suite is Rosetta, a collection of algorithms for protein structure prediction, docking, and design (10–13), as well as protein interactions with small molecules (14), nucleic acids (15), and carbohydrates in solution or in a lipid bilayer (16). Rosetta has been a scientific leader in several blind structure prediction challenges (17–21) and has shown proof of principle for many design goals, including de novo folds (22–24), loop design, interface design (25–28), symmetric assembly (29, 30), and mineral binding (31, 32). In addition to its success in science and engineering, Rosetta is suited for teaching structure prediction and design for several reasons. The Rosetta methods are available as a Python library called PyRosetta (33), which makes them easier to learn and combine with other scientific code libraries. PyRosetta allows access to low-level data and has a range of prebuilt protocols for many tasks in biophysical research. Students can measure and manipulate protein conformations, dock proteins and small molecules, run folding algorithms, and explore other emerging topics in biomolecular structure prediction and design, such as RNA modeling and noncanonical amino acids. Furthermore, students can learn how to use these tools by creating and testing their own algorithms.
For about a decade now, structure prediction and design has been taught with PyRosetta, primarily through the use of a set of workshops that are available both as a printed book (34) and as downloadable Portable Document Format files (35). These workshops have been used to teach a course for undergraduate and graduate students at Johns Hopkins University for over 10 years and intermittently at other schools, including the Massachusetts Institute of Technology, Stanford University, The University of Kansas, and the University of North Carolina. Workshops have been downloaded over 120,000 times (several tutorials over 1,000 times per year), and a complementary set of online lecture videos has registered over 14,000 views, reflecting a fast-growing interest in biomolecular structure prediction and design. In addition, these workshops have been an important resource for the Rosetta community, with the workshops being the primary learning tool for many now senior core developers.
Despite the strong demand for educational resources, there have been several challenges in teaching with these materials. One problem in this interdisciplinary field has been how to train students from all levels and different skill sets. To address this challenge, the RosettaCommons has established several programs and resources for students and researchers who are interested in Rosetta and PyRosetta, such as the PyRosetta and C++ code academies and the Rosetta research experience for undergraduates (36). Other salient resources from the Rosetta community include Extensible Markup Language (XML) documentation (37), the Rosetta user guide (38), code manual (39), and the active, managed user forum (40). Although these resources have helped expose the field to a broad audience, hurdles still remain. Learning new software can be challenging for beginners, and the available beginner academies have limits on the annual cohort size. In addition, most of the resources currently available are in the form of code documentation or static text, which lack the interactive components that would enable active learning. Multimodal environments (e.g., including visualizations in addition to text) enhance students' mental representations of fundamental concepts (41, 42), and in at least one coding class, interactive Web-based content increased student time engaging with material and improved quiz scores (43).
In addition, there are technologic challenges that pose problems for new learners. Because the capabilities of Rosetta are constantly changing and expanding, as methods are modified and new algorithms added, educational resources need to evolve in parallel with the main Rosetta software. Since the original PyRosetta workshops, some commands and protocols have been deprecated or replaced, and many new frameworks have become standard. Static text workshops have been difficult to maintain because they require manual testing and updating. A related challenge is that PyRosetta is difficult to configure on Windows, making the barrier of entry for some beginners and self-learners prohibitively high.
In this work, we describe our latest contribution to address the pedagogic and technologic limitations of previous educational resources by creating an accessible, multimedia platform for teaching biomolecular structure prediction and design methods with PyRosetta. Our solution combines the accessibility of Jupyter Notebooks, a shareable Web application that supports live code, equations, visualization, and text (44), with the free computing power of Google Colaboratory (45) to develop a way for students of all experience levels to use PyRosetta on the cloud. Starting with our existing static workshops, we created a new, expanded set of interactive, multimedia Jupyter Notebooks with coding examples and conceptual questions that engage students with the material and let them test their understanding. We discuss in the following how this approach may improve engagement and retention and how the technical implementation removes barriers to entry and enables the materials to stay current with emerging Rosetta methods.
III. RESULTS
A. PyRosetta workshops cover a broad range of basic and advanced topics
To make a broad range of topics in the field accessible to the public, we have created a diverse set of PyRosetta workshops within Jupyter Notebooks (i.e., PyRosetta notebooks) and shared them in a public, open-source GitHub repository (https://github.com/RosettaCommons/PyRosetta.notebooks). These notebooks aim to teach both the fundamentals, as well as the applications of biomolecular structure prediction and design. The set of notebooks currently includes 16 modules and is split into two parts. Part I introduces the basics of PyRosetta (Chapters 2 to 9), and Part II explores advanced applications (Chapters 10 to 16), such as antibody design and membrane protein modeling (Fig 1). Chapter 1 walks students through the process of setting up PyRosetta in Google Colaboratory with step-by-step instructions and guiding screenshots.
Part I focuses on the two main scientific capabilities of Rosetta: sampling and scoring biomolecular conformations. The notebooks explain the technical basics of PyRosetta, starting with how to create a Pose object, which is the container that holds all atoms, molecules, coordinates, energies, and other details about the system. Next, students learn how to make a ScoreFunction object to approximate free energies and how to combine Mover objects to manipulate the Pose conformations. In addition to these technical skills, Part I builds the fundamental theoretic concepts that frame the challenges of sampling and scoring conformations. Students are introduced to the Levinthal paradox, the idea that the conformational space available to proteins is exponentially large and thus impossible to search comprehensively (46). They also learn about Anfinsen dogma, the idea that a folded protein is at a thermodynamic minimum free energy state (47). The workshops teach students how to use various potential functions, which can be physics based (van der Waals, Coulomb) or knowledge based (hydrogen bonding, side-chain energies). Students learn that these functions can be empirically optimized for protein-scale phenomena, such as folding and design. Most exercises in Part I are short and provide detailed guidance. Moving GIF animations, schematics, and images of biomolecules (created in PyMOL software) are used to illustrate general concepts (48). For example, students are guided through the application of a TrialMover, which tests a conformational change, evaluates the new energy, and uses the Metropolis Monte Carlo criterion to either accept or reject the change (49). In addition to general concepts, some visualizations also depict expected outcomes, such as a PyMOL movie of a basic folding algorithm. The learning objectives for the workshops in Part I can be found in Table 1.
Part II guides learners through advanced applications of PyRosetta, relying on the basic skills and concepts introduced in Part I. Chapter 12, for example, explores how PyRosetta can also be used to model and design antibodies, which is an important challenge faced by pharmaceutic companies (50). In Chapter 14, students learn how to apply the same approaches to predict RNA structures, which are increasingly recognized for critical roles in catalysis and regulation (51). Chapter 15 explores the tools for investigating membrane proteins, which include approximately 60% of drug targets (52). A larger emphasis is placed on workshop exercises to introduce learners to a variety of questions and methods that are currently used in the field. For advanced students, Chapter 16 reviews more intensive tasks that can be executed outside of Google Colaboratory, such as parallelization with GNU (www.gnu.org) and dask libraries (53). The learning objectives for the workshops in Part II can be found in Table 2.
B. Students can access the multimedia PyRosetta workshops on the Google Colaboratory platform
Google Colaboratory is an online Web environment for Jupyter Notebooks on a cloud-based virtual machine, accessible with any browser. Google Colaboratory provides students with powerful computational resources, including 13 GB of random-access memory, 33 GB of disk space, 2.30 GHz of a central processing unit, and continuous sessions of up to 12 h (45). Although Jupyter Notebooks have been used for engineering education (54, 55), Google Colaboratory offers a few advantages for studying biomolecular modeling, starting with the free in the cloud computing power. Students can complete most of the PyRosetta notebooks in the Google Colaboratory environment (Fig 2). They can open notebook files and store different versions directly in Google Drive. The initial configuration of the PyRosetta software package in Google Colaboratory is automated and takes approximately 10 min. Afterwards, students simply import the supporting pip package pyrosettacolabsetup (56) and the configured PyRosetta package. Students can complete the provided exercises to build their own solutions and modify any line of code in the workshops, which pair introductory passages, concepts, and exercises with supporting PyMOL images, movies, and diagrams (Fig 3).
C. Jupyter Notebooks enable features for students and instructors
To create both student and instructor versions of assignments in the notebooks, we incorporated nbgrader (57). The nbgrader module enables developers and instructors to create and maintain a single master copy of each workshop. The master copy includes solutions to all exercises, and the student version of the workshop is automatically generated without selected solutions (Fig 4). Thus, developers can write PyRosetta coding examples and problems for students to attempt on their own. To help students locate examples of specific concepts and commands, we also incorporated nbpages (58), which enables the automatic generation of the table of contents and a searchable keyword index in notebook and markdown form (Fig 5). These tools are activated by the provided make-student-nb.bash script, which developers can use to update the student notebooks, table of contents, and keywords index with a single command.
Instructors can make changes to the original set of Jupyter Notebook workshops by forking the main public repository (59). This allows instructors to tailor the workshops for specific curriculums. In addition, any changes to the repository files that would benefit the public can be incorporated directly into the main repository via GitHub pull requests, which can be reviewed and approved by a RosettaCommons member. Figure 6A shows a workflow for instructors who simply want to use the workshops in courses, and Figure 6B illustrates a workflow for instructors who wish to make changes to the material.
Further, in Chapter 2, we showcase the ability to visualize macromolecules directly within the Jupyter notebooks by using py3Dmol, a Web-based Jupyter widget encompassing an interactive 3Dmol.js molecular viewer (60). The py3Dmol bindings (in the pyrosetta.distributed.viewer namespace) facilitate on-the-fly, interactive visualization of PyRosetta ResidueSelector objects, which allow students to choose subsets of residues on the basis of the sequence, chemistry, or structural properties (Fig 3C). For those who install PyRosetta on a local computer, the motions of a protein in a protocol can be watched in an external PyMOL window by using the PyRosetta PyMOLObserver (61).
Chapter 16 demonstrates how to scale up simulations to high-performance computing resources by using the Slurm workload manager (https://slurm.schedmd.com/) with GNU parallel, dask and distributed modules (53, 62). We additionally introduce the pyrosetta.distributed.dask namespace for PyRosetta integration with the dask-jobqueue module, providing a user-friendly interface for PyRosetta preinitialization of worker machines allowing options-based configuration of macromolecular modeling tasks in distributed computing and cloud computing environments. These developments will enable future pedagogic programs to encompass advanced macromolecular modeling exercises and allow for additional educational content to be added with ease.
D. Learning outcomes
We piloted early versions of the notebooks in 4 separate teaching contexts. They were used in a formal university course in spring 2019 for a combined graduate and undergraduate elective course, in a code academy for new graduate and postdoctoral students and in RosettaCommons labs and for a 1-week code school for a class of undergraduate summer interns. Finally, we have shared the GitHub link with several individuals learning PyRosetta on their own.
In the spring of 2019, the formal university course, ChemBE414/614: Protein Structure Prediction and Design, enrolled 9 undergraduate students and 11 master's or PhD students. Over the years, students who have taken this course have come from departments of chemical and biomolecular engineering, biomedical engineering, biophysics, chemistry, computer science, and applied math. In the precourse survey, roughly half (45%) of the spring 2019 class indicated that they had “good” or “expert-level” familiarity with Python. Nearly all of these students had programming experience in some language prior to the course. However, experience levels varied greatly in programming, biology, chemistry, and math. Following the course, students gave high reviews (quality of course 4.65 of 5.00 and teaching effectiveness 4.53 of 5.00). The biomolecular computation skills gained by students were evidenced by a range of successful course projects on topics including “Structure-Based Prediction of Peptide-MHC Binding,” “Finding the Relationship between Epistasis and Score in Sequentially Mutated TEM-1 β-Lactamase,” and, on the methodologic side, “Comparison of Optimization Methods Used in Protein Structure Prediction.” This course used the notebooks on local Linux (https://www.linux.org) or Mac (Apple, Inc., Cupertino, CA) installations, without the use of the Google Colaboratory platform. In the pilot, multiple students mentioned the technical challenges of using PyRosetta on their computer in course evaluation.
Similar to ChemBE414/614, code academy trainees varied in programming and scientific experience. In the precourse survey, 7 of 19 of trainees (37%) indicated that they had “very little to no programming experience.” After the course, trainees were asked to respond to a postcourse survey. When asked about whether Jupyter Notebooks were effective teaching tools, 13 of 15 respondents selected “agree” or “strongly agree.” Furthermore, all respondents agreed or strongly agreed that the course gave them confidence to write more advanced protocols for research. Code academy trainees completed miniprojects such as “Antibody Design for Ebola” and “Modeling Intrinsically Disordered Proteins for Cell Signaling.”
The summer interns continued to complete successful research projects in 10 different academic labs and one industry research site (36). Finally, some self-paced learners who tested the complete multimedia workshops shared comments including: “these notebooks make PyRosetta more approachable to non-experts,” “you can install PyRosetta in your Google Drive and use it from many different machines,” and “attempting problems myself allowed me to pinpoint gaps in my understanding.”
IV. DISCUSSION
Protein structure prediction and design tools are powerful and have the potential to impact biophysics and many cognate disciplines, but there are several challenges for students including access to the tools and the varied backgrounds of students. Here, we have described a set of interactive notebooks for learning biomolecular structure prediction and design that can be used in a classroom context or for individual self-study. Educators within the Rosetta community have already used these notebooks extensively, and we hope that educators teaching high school, undergraduate, and graduate courses will also benefit from using and adapting these notebooks. A good starting point for new instructors to develop the necessary background to teach these workshops is a recent review article by Kuhlman and Bradley (63). New instructors can complete the workshops themselves and read the associated primary literature linked within each notebook.
Students are advised to have some familiarity with basic Python capabilities, including creating and calling variables, functions, and classes. For a classroom setting, reviewing these skills prior to attempting the workshops may be beneficial. In ChemBE414/614 at Johns Hopkins, the instructor spent 1 week reviewing the necessary skills and assigned 1 homework assignment practicing Python.
One of the advantages of this platform is that it is free and publicly available on GitHub (https://github.com/RosettaCommons/PyRosetta.notebooks). Another advantage is that PyRosetta can be accessed through the Google Colaboratory online in a Web browser, which requires no local computer installation and can quickly integrate open-source packages. In addition to its accessibility, Google Colaboratory provides students with free and powerful computational resources (45). These advantages address the current technologic challenges with the current resources for PyRosetta. This online platform also provides an environment for multimodal learning material, such as molecular visualization movies and coding examples. With a broad scope of topics from contributors of different areas of expertise, students are also able to gain exposure to the different applications of PyRosetta and develop the skills to pursue more in-depth applications. For instructors, this set of modules can be easily adapted to a course syllabus by modifying workshops or adding relevant examples.
In addition to the advantages, the platform has some limits. The Google Colaboratory platform complicates communication with the visualization software PyMOL (48), and we have so far been unable to make this connection simple. Although the PyMOLObserver is the archetypal tool for real-time visualization of PyRosetta modeling trajectories (61), students with a local installation of PyRosetta are, nevertheless, able to view the algorithms in real-time on the local computer's PyMOL. Within the Google Colaboratory, the pyrosetta.distributed.viewer (with py3Dmol bindings) currently supports dynamic visualization updates upon biomolecular conformational changes, which is convenient for viewing intermediate steps of biomolecular modeling tasks, such as between PyRosetta movers, directly within Jupyter Notebooks. Although the pyrosetta.distributed.viewer mimics only a subset of PyMOL functionalities, it accepts ResidueSelector-based user inputs, thus allowing a more streamlined interface to interactive biomolecular modeling and design. However, because py3Dmol does not have multithreading or communication between threads, Google Colaboratory users cannot continuously update an instance of the 3Dmol.js molecular viewer as a PyRosetta protocol trajectory is calculated.
Our notebooks may be compared with other educational materials for computational molecular biophysics. There are several textbook-style software resources, such as the beginner's guide to CHARMM (64) and the Web-based lessons on CHARMM (65). Recently, MOE has been used for an integrated engineering curriculum (66). Additionally, Foldit (67) and EteRNA (68) have been used in an interdisciplinary week-long program for undergraduate and high school students (69). Our contribution of PyRosetta notebooks is complementary to these and has advantages with its active involvement of students, multimedia integration, and engagement with viable and leading tools that can be used flexibly in new, innovative research. In addition, unlike many other electronic workshop materials, the PyRosetta notebooks are used to automatically test new versions of the PyRosetta software. Each Jupyter Notebook is converted into a simple Python script that is run continually on the community's testing servers, and any malfunctions from new or modified code must be fixed before accepted into the main Rosetta repository. The automated notebook testing and GitHub pull request practices ensure that the workshops always remain functional for users.
Overall, the PyRosetta notebooks are designed to be a gateway tool to introduce students to the fundamentals of biomolecular structure prediction and design. This platform could potentially be used in high school lab courses, science, technology, engineering, and mathematics summer programs, advanced undergraduate courses, as well as graduate courses. In addition, we encourage students to seek additional support from the distributed, collaborative network by posting on the RosettaCommons forum (https://www.rosettacommons.org/forum). In addition, Google Colaboratory is compatible with TensorFlow (https://www.tensorflow.org/) and is heavily used for developing machine learning models because of the free graphics processing unit resources. In the future, machine learning could be incorporated with the capabilities of PyRosetta to explore new areas of research in the field (70). Furthermore, the GitHub archive is a living collection that will continually expand to include other applications of PyRosetta and aspects of macromolecular modeling to introduce a broader audience to the field.
Contributor Notes