Tutorial 1: Training a Diffusion Model¶

This tutorial explains how to configure and run a training job for a diffusion model from scratch. We will focus on using a single configuration file for your experiment to override the project’s default settings.

The “Override-Only” Workflow¶

This project uses a powerful configuration framework Hydra. The easiest and cleanest way to use it is to have one single YAML file for your experiment where you define all your custom settings.

Step 1: Create Your Experiment File¶

Your experiment file is your personal workspace. You can create it anywhere, but for this tutorial, we will create it in the current directory. Start by copying an example template from the project:

cp configs/example_diffusion_config.yaml my_first_run.yaml

Now, open my_first_run.yaml. This is the only file you’ll need to edit.

Step 2: Understand the `defaults` List¶

The defaults list at the top of the file loads a set of pre-defined “templates” for each part of your experiment (data, model, trainer, etc.) that are bundled with the package.

defaults:
  - data: mol_dataset
  - tasks: diffusion
  - logger: default
  - trainer: default
  - _self_

Think of these default files as a reference manual. You can find the original base configurations in the configs/ directory of the repository (e.g., in configs/data/, configs/tasks/) to see what parameters are available, but you should not edit them directly. All changes are made in your local my_first_run.yaml.

Step 3: Set Your Key Parameters¶

This is the most important step. You will override the default parameters to configure your specific experiment. Below are the most common parameters you will want to set.

Essential Paths¶

Parameter	Example Override in `my_first_run.yaml`	Description
`trainer.output_path`	`trainer: {output_path: "results/my_run"}`	CRITICAL: Where all logs and checkpoints are saved.
`data.filename`	`data: {filename: "molecules.csv"}`	The CSV file with molecule information.
`data.xyz_dir`	`data: {xyz_dir: "xyz_files/"}`	The directory containing `.xyz` geometry files.
`data.ase_db_path`	`data: {ase_db_path: "data/qm9.db"}`	Path to an ASE database file (`.db`) or a directory containing `.db` files. This is an alternative to `data.filename` and `data.xyz_dir`.

Data Processing and Caching¶

The first time you run a training job, the script processes your raw dataset (.xyz files, etc.) into a format suitable for training. This processed data is saved as a file named processed_data_{dataset_name}.pt inside the directory specified by data.root. On subsequent runs, if this file exists, it will be loaded directly to save time.

Parameter	Example Override	Description
`data.root`	`data: {root: "data/processed"}`	The directory where processed dataset files are stored.
`data.dataset_name`	`data: {dataset_name: "my_molecule_set"}`	A unique name for your processed dataset. This becomes part of the saved filename (`processed_data_my_molecule_set.pt`). This is crucial for preventing conflicts when you work with multiple datasets.
`data.max_atom`	`data: {max_atom: 50}`	Sets the maximum molecular size (number of atoms). Larger molecules will be discarded. If not specified, the maximum size is determined automatically by scanning the dataset, which can be slow.

Best Practice:

Set a descriptive dataset_name for each new dataset you work with.
This ensures that you can easily manage and reuse your processed data without accidentally overwriting or loading the wrong file.

Core Training Hyperparameters¶

Parameter	Example Override	Description
`trainer.num_epochs`	`trainer: {num_epochs: 200}`	How long to train.
`trainer.lr`	`trainer: {lr: 0.0001}`	The learning rate.
`data.batch_size`	`data: {batch_size: 64}`	Number of molecules per batch.
`seed`	`seed: 42`	Top-level parameter for reproducibility.

Model & Task Hyperparameters¶

This section defines the model architecture and the specifics of the diffusion task. You can configure several training modes:

Unconditional Mode: The model learns the general distribution of molecules. To use this mode, ensure tasks.condition_names is an empty list [].
Conditional Mode: The model learns to generate molecules given certain properties (e.g., energy, size). To use this mode, you must provide a list of property names in tasks.condition_names and ensure your dataset contains columns with these exact names.
Classifier-Free Guidance (CFG) Training: A subset of conditional training. Set tasks.context_mask_rate > 0 (e.g., 0.1). This randomly hides the condition during training, enabling CFG during generation.
Self-Pace Learning: A curriculum learning strategy where the model learns from “easier” examples first. Enable it with tasks.sp_regularizer_deploy: True.

Key Model & Task Parameters to Override:

Parameter	Example Override	Description
`tasks.condition_names`	`tasks: {condition_names: [prop1, prop2]}`	List of property names for conditional training. Leave empty `[]` for unconditional.
`tasks.context_mask_rate`	`tasks: {context_mask_rate: 0.1}`	(Conditional Only) Probability of masking the condition. `0` for standard conditional training, `> 0` for CFG training.
`tasks.hidden_size`	`tasks: {hidden_size: 256}`	The main dimension/width of the model.
`tasks.num_layers`	`tasks: {num_layers: 9}`	The number of layers (depth) in the model.
`tasks.diffusion_steps`	`tasks: {diffusion_steps: 500}`	Number of steps in the diffusion process.
`tasks.sp_regularizer_deploy`	`tasks: {sp_regularizer_deploy: True}`	Set to `True` to enable Self-Pace Learning.
`tasks.sp_regularizer_regularizer`	`tasks: {sp_regularizer_regularizer: 'logaritmic'}`	The pacing function. Options: `hard` (default), `linear`, `logaritmic`, `logistic`.
`tasks.sp_regularizer_lambda_`	`tasks: {sp_regularizer_lambda_: 1}`	A key parameter that controls the learning pace.

Experiment Logging¶

You can control how results are logged by overriding parameters under the logger: key. The most important choice is whether to log to local files or to Weights & Biases (wandb).

To switch between loggers, modify the defaults list in my_first_run.yaml:

For simple local file logging: defaults: [..., - logger: default, ...]
For Weights & Biases: defaults: [..., - logger: wandb, ...]

Key Logging Parameters to Override:

Parameter	Example Override	Description
`logger.log_interval`	`logger: {log_interval: 10}`	How often (in training steps) to log metrics like loss.
`logger.project_wandb`	`logger: {project_wandb: "My_Project"}`	(W&B only) The name of the project on your W&B dashboard.
`name`	`name: "complex_mols_run_1"`	The top-level `name` parameter is used as the run name for both local logs and W&B.

Step 4: Putting It All Together¶

Here is what a complete my_first_run.yaml for a CFG-ready conditional model might look like:

# Inherit from the default templates
defaults:
  - data: mol_dataset
  - tasks: diffusion
  - logger: wandb
  - trainer: default
  - _self_

# Set top-level experiment parameters
name: "my_cfg_model_training"
seed: 42

# Override Essential Paths and Hyperparameters
trainer:
  output_path: "training_outputs/my_cfg_model"
  num_epochs: 200
  lr: 0.0002

logger:
  project_wandb: "My_Diffusion_Project"

data:
  batch_size: 64

tasks:
  condition_names: ["S1_exc", "T1_exc"] # Specify properties from our dataset
  context_mask_rate: 0.1 # Enable CFG training
  hidden_size: 256

Step 5: Run Your Training¶

Launch the training using the MolCraftDiff command-line tool. Provide the train command followed by the name of your configuration file.

Command:

MolCraftDiff train my_first_run

The tool will automatically find my_first_run.yaml in your current directory, build the full configuration from your defaults and overrides, and start the training. All results will be saved in the trainer.output_path you specified.

Next Steps: Generation¶

Once your diffusion model is trained, what’s next? You can use it to generate new 3D molecules!

Check out the generation and guidance tutorials to learn how to deploy your trained models: