Tutorial 2: Training a Regressor Model¶

This tutorial explains how to train a model to predict specific molecular properties (e.g., energy, solubility). This regressor model can be used as a standalone predictor or, more powerfully, as a guidance model to steer molecule generation towards desired property values (as we will see in Tutorial 07).

Configuration¶

We use the exact same override-only configuration workflow introduced in Tutorial 1: Training a Diffusion Model, but we load the regression templates via the defaults list.

# Inside my_regressor_run.yaml
defaults:
  - data: mol_dataset
  - tasks: regression      # Use the regression task configuration
  - logger: wandb
  - trainer: regression    # Use the regression-specific trainer settings
  - _self_

Key Parameters for Regression¶

Below are the key parameters and recommended settings to override when training a regression model.

Parameter	Example Override	Description
`trainer.output_path`	`trainer: {output_path: "results/my_regressor"}`	CRITICAL: Where your trained regressor model is saved.
`data.ase_db_path`	`data: {ase_db_path: "data/my_dataset.db"}`	Path to your compiled ASE database containing molecular properties.

(Note: Ensure you have prepared your database as described in Tutorial 0: Data Preparation & Management.)

Data Settings¶

Parameter	Example Override	Notes / Recommendations
`data.batch_size`	`data: {batch_size: 128}`	A larger batch size can often be used for this task.
`data.data_type`	`data: {data_type: "pyg"}`	CRITICAL: For regression and guidance tasks, the data type must be set to `pyg`.

Regression Task Hyperparameters¶

Parameter	Example Override	Notes / Recommendations
`tasks.task_learn`	`tasks: {task_learn: ["S1_exc"]}`	CRITICAL: Tell the model which property from your dataset to predict.
`tasks.hidden_size`	`tasks: {hidden_size: 512}`	Regressors often benefit from being wider than diffusion models. `512` is a good starting point.
`tasks.act_fn`	`act_fn: {_target_: torch.nn.ReLU}`	`ReLU` is a common and effective activation function for regression tasks.
`tasks.num_layers`	`tasks: {num_layers: 1}`	For property prediction, it is preferred to have just one block of EGCL.
`tasks.num_sublayers`	`tasks: {num_sublayers: 4}`	Inside the single EGCL block, use multiple sublayers for a deeper model.

Trainer Settings for Regression¶

Parameter	Example Override	Notes / Recommendations
`trainer.optimizer_choice`	`trainer: {optimizer_choice: "adam"}`	`adam` is a solid default optimizer for regression.
`trainer.lr`	`trainer: {lr: 0.0005}`	Regression can often be trained with a slightly higher learning rate than diffusion models.
`trainer.scheduler`	`trainer: {scheduler: "reducelronplateau"}`	`reducelronplateau` is highly recommended. It automatically lowers the learning rate when validation loss stops improving.
`trainer.ema_decay`	`trainer: {ema_decay: 0.0}`	Important: Exponential Moving Average (EMA) is typically disabled for regressor training by setting the decay to `0.0`.

Experiment Logging¶

Parameter	Example Override	Description
`logger.project_wandb`	`logger: {project_wandb: "My_Regressor_Project"}`	(W&B only) The name of the project on your W&B dashboard.
`name`	`name: "s1_t1_regressor"`	The top-level `name` is used as the run name for logs.

Putting It All Together¶

Here is a complete my_regressor_run.yaml example:

defaults:
  - data: mol_dataset
  - tasks: regression
  - logger: wandb
  - trainer: regression
  - _self_

name: "my_s1_t1_regressor"
seed: 42

trainer:
  output_path: "training_outputs/my_s1_t1_regressor"
  num_epochs: 100

logger:
  project_wandb: "My_Regressor_Project"

data:
  data_type: "pyg"
  batch_size: 128

tasks:
  task_learn: ["S1_exc", "T1_exc"]
  hidden_size: 512
  num_layers: 1
  num_sublayers: 4

Launch the training as usual:

MolCraftDiff train my_regressor_run

Next Steps: Property Prediction and Guidance¶

Once your regression model is trained, you can use it in two main ways:

As a Standalone Predictor: Use the predict module to predict properties for batches of existing 3D molecules.
As a Guidance Model: Use the predictions to steer the creation of new molecules.

Learn how to use your regressor to guide diffusion generation in Tutorial 7: Property-Directed Generation (CFG/GG).