Tutorial 1: Training a Diffusion Model¶
This tutorial explains how to configure and run a training job for a diffusion model from scratch. We will focus on using a single configuration file for your experiment to override the project’s default settings.
The “Override-Only” Workflow¶
This project uses a powerful configuration framework Hydra. The easiest and cleanest way to use it is to have one single YAML file for your experiment where you define all your custom settings.
Step 1: Create Your Experiment File¶
Your experiment file is your personal workspace. You can create it anywhere, but for this tutorial, we will create it in the current directory. Start by copying an example template from the project:
cp configs/example_diffusion_config.yaml my_first_run.yaml
Now, open my_first_run.yaml. This is the only file you’ll need to edit.
Step 2: Understand the defaults List¶
The defaults list at the top of the file loads a set of pre-defined “templates” for each part of your experiment (data, model, trainer, etc.) that are bundled with the package.
defaults:
- data: mol_dataset
- tasks: diffusion
- logger: default
- trainer: default
- _self_
Think of these default files as a reference manual. You can find the original base configurations in the configs/ directory of the repository (e.g., in configs/data/, configs/tasks/) to see what parameters are available, but you should not edit them directly. All changes are made in your local my_first_run.yaml.
Step 3: Set Your Key Parameters¶
This is the most important step. You will override the default parameters to configure your specific experiment. Below are the most common parameters you will want to set.
Essential Paths¶
Parameter |
Example Override in |
Description |
|---|---|---|
|
|
CRITICAL: Where all logs and checkpoints are saved. |
|
|
The CSV file with molecule information. |
|
|
The directory containing |
|
|
Path to an ASE database file ( |
Data Processing and Caching¶
The first time you run a training job, the script processes your raw dataset (.xyz files, etc.) into a format suitable for training. This processed data is saved as a file named processed_data_{dataset_name}.pt inside the directory specified by data.root. On subsequent runs, if this file exists, it will be loaded directly to save time.
Parameter |
Example Override |
Description |
|---|---|---|
|
|
The directory where processed dataset files are stored. |
|
|
A unique name for your processed dataset. This becomes part of the saved filename ( |
|
|
Sets the maximum molecular size (number of atoms). Larger molecules will be discarded. If not specified, the maximum size is determined automatically by scanning the dataset, which can be slow. |
Best Practice:
Set a descriptive
dataset_namefor each new dataset you work with.This ensures that you can easily manage and reuse your processed data without accidentally overwriting or loading the wrong file.
Core Training Hyperparameters¶
Parameter |
Example Override |
Description |
|---|---|---|
|
|
How long to train. |
|
|
The learning rate. |
|
|
Number of molecules per batch. |
|
|
Top-level parameter for reproducibility. |
Model & Task Hyperparameters¶
This section defines the model architecture and the specifics of the diffusion task. You can configure several training modes:
Unconditional Mode: The model learns the general distribution of molecules. To use this mode, ensure
tasks.condition_namesis an empty list[].Conditional Mode: The model learns to generate molecules given certain properties (e.g., energy, size). To use this mode, you must provide a list of property names in
tasks.condition_namesand ensure your dataset contains columns with these exact names.Classifier-Free Guidance (CFG) Training: A subset of conditional training. Set
tasks.context_mask_rate > 0(e.g.,0.1). This randomly hides the condition during training, enabling CFG during generation.Self-Pace Learning: A curriculum learning strategy where the model learns from “easier” examples first. Enable it with
tasks.sp_regularizer_deploy: True.
Key Model & Task Parameters to Override:
Parameter |
Example Override |
Description |
|---|---|---|
|
|
List of property names for conditional training. Leave empty |
|
|
(Conditional Only) Probability of masking the condition. |
|
|
The main dimension/width of the model. |
|
|
The number of layers (depth) in the model. |
|
|
Number of steps in the diffusion process. |
|
|
Set to |
|
|
The pacing function. Options: |
|
|
A key parameter that controls the learning pace. |
Experiment Logging¶
You can control how results are logged by overriding parameters under the logger: key. The most important choice is whether to log to local files or to Weights & Biases (wandb).
To switch between loggers, modify the defaults list in my_first_run.yaml:
For simple local file logging:
defaults: [..., - logger: default, ...]For Weights & Biases:
defaults: [..., - logger: wandb, ...]
Key Logging Parameters to Override:
Parameter |
Example Override |
Description |
|---|---|---|
|
|
How often (in training steps) to log metrics like loss. |
|
|
(W&B only) The name of the project on your W&B dashboard. |
|
|
The top-level |
Step 4: Putting It All Together¶
Here is what a complete my_first_run.yaml for a CFG-ready conditional model might look like:
# Inherit from the default templates
defaults:
- data: mol_dataset
- tasks: diffusion
- logger: wandb
- trainer: default
- _self_
# Set top-level experiment parameters
name: "my_cfg_model_training"
seed: 42
# Override Essential Paths and Hyperparameters
trainer:
output_path: "training_outputs/my_cfg_model"
num_epochs: 200
lr: 0.0002
logger:
project_wandb: "My_Diffusion_Project"
data:
batch_size: 64
tasks:
condition_names: ["S1_exc", "T1_exc"] # Specify properties from our dataset
context_mask_rate: 0.1 # Enable CFG training
hidden_size: 256
Step 5: Run Your Training¶
Launch the training using the MolCraftDiff command-line tool. Provide the train command followed by the name of your configuration file.
Command:
MolCraftDiff train my_first_run
The tool will automatically find my_first_run.yaml in your current directory, build the full configuration from your defaults and overrides, and start the training. All results will be saved in the trainer.output_path you specified.
Next Steps: Generation¶
Once your diffusion model is trained, what’s next? You can use it to generate new 3D molecules!
Check out the generation and guidance tutorials to learn how to deploy your trained models: