Tutorial 10: Generation Parameter Sweeps¶

Prerequisites: Tutorial 6 — Structure-Guided / Tutorial 7 — Property-Directed · You’ll learn: automating grid and Bayesian sweeps over controlled-generation parameters with automatic metric collection · Next: Workflows

This tutorial explains how to run generation sweeps with the packaged sweep module:

MolecularDiffusion.runmodes.generate.sweep

The command-line entry point is:

MolCraftDiff generate-sweep <sweep_config.yaml>

Use sweeps when you want to optimize controlled generation settings for inpainting, outpainting, classifier-free guidance (CFG), gradient guidance (GG), or hybrid CFG/GG without manually launching each run.

1. What a Sweep Does¶

A sweep repeatedly runs:

MolCraftDiff generate <base_config> <Hydra overrides...>

For each parameter combination, the sweep creates a run-specific output directory and then calls your evaluation workflow:

bash <eval_script> <run_output_dir>

After the workflow finishes, the sweep reads registered CSV files from the run output directory and appends one row to:

<sweep_dir>/summary.csv

This gives you a single table containing the generation parameters and the metrics for each run.

2. Minimal Command¶

Start with a dry-run. This prints the exact generation and evaluation commands without writing files:

MolCraftDiff generate-sweep configs/sweep_example.yaml --dry-run --max-runs 1

Then launch real runs:

MolCraftDiff generate-sweep configs/sweep_example.yaml --max-runs 5

Useful options:

Option	Meaning
`--dry-run`	Print commands only; no generation, eval, logs, summary, or copied config.
`--max-runs N`	Run at most `N` new parameter combinations.
`--skip-gen`	Skip generation and only run evaluation/collection.
`--skip-eval`	Skip evaluation and only run generation/collection.
`--retry-failed`	Retry rows marked failed or interrupted in `summary.csv`.
`--early-fail-batches N`	Stop a trial after `N` consecutive failed generation batches before any success (overrides `search.early_fail_batches`).

3. Sweep Config Anatomy¶

A complete commented template is available at:

configs/sweep_example.yaml

The main fields are:

base_config: docs/cfg_examples/gen_cfg.yaml
eval_script: scripts/eval_workflow.sh
sweep_dir: sweep_results/example_sweep
name: example_sweep

base_config is the normal generation YAML you would pass to MolCraftDiff generate.

eval_script is a shell workflow that receives the run output directory as its only argument:

bash scripts/eval_workflow.sh sweep_results/example_sweep/<run_name>

sweep_dir contains:

sweep_results/example_sweep/
├── sweep_config.yaml
├── summary.csv
└── example_sweep_<parameter_tag>/

Each run directory is also passed to generation as:

interference.output_path=<run_output_dir>

4. Varying Parameters¶

Use parameters for scalar Hydra overrides:

parameters:
  interference.condition_configs.cfg_scale: [1.5, 2.0, 2.5]
  interference.condition_configs.outpaint_cfgs.t_start: "0.55:0.95:0.20"
  interference.condition_configs.outpaint_cfgs.spread: [0.5, 1.0, 1.5]
  diffusion_steps: [50, 300]

The range string is inclusive. This:

"0.55:0.95:0.20"

expands to:

0.55, 0.75, 0.95

Use list_parameters when the Hydra value itself is a list:

list_parameters:
  interference.mol_size:
    - [23, 45]
    - [23, 55]
    - [23, 65]
  interference.negative_target_values:
    - [-10, 0, -1]
    - [-15, 0, -1]

Every scalar value and every list-valued choice is combined by Cartesian product.

Use extra_overrides for settings applied to every run:

extra_overrides:
  - "interference.num_generate=60"
  - "interference.batch_size=2"

Override order is:

varied sweep parameters,
generated interference.output_path,
extra_overrides.

Later overrides win if the same Hydra key appears more than once.

5. Controlled Generation Parameters to Sweep¶

Inpainting¶

For inpainting, sweep the parameters that control how much of the masked region changes and how strongly it is constrained:

base_config: configs/my_inpaint.yaml

parameters:
  interference.condition_configs.inpaint_cfgs.denoising_strength: [0.4, 0.6, 0.8, 1.0]
  interference.condition_configs.inpaint_cfgs.constraint_strength: [0.6, 0.8, 1.0]
  interference.condition_configs.inpaint_cfgs.scale_factor: [1.0, 1.1, 1.2]
  diffusion_steps: [300, 600]

Use this when you want to find the best balance between scaffold retention and novelty.

Outpainting¶

For outpainting, sweep seed placement and denoising length:

base_config: configs/my_outpaint.yaml

parameters:
  interference.condition_configs.outpaint_cfgs.t_start: [0.55, 0.75, 0.95]
  interference.condition_configs.outpaint_cfgs.seed_dist: [1.0, 1.5, 2.0]
  interference.condition_configs.outpaint_cfgs.min_dist: [1.5, 2.0, 2.5]
  interference.condition_configs.outpaint_cfgs.spread: [0.5, 1.0, 1.5]
  interference.condition_configs.outpaint_cfgs.constraint_strength: [0.6, 0.8]

Use this when some runs clash with the scaffold, fail to connect, or generate too little diversity.

CFG¶

For classifier-free guidance, the central parameter is cfg_scale:

base_config: docs/cfg_examples/gen_cfg.yaml

parameters:
  interference.condition_configs.cfg_scale: [0.5, 1.0, 1.5, 2.0, 3.0]
  diffusion_steps: [300, 600]

list_parameters:
  interference.target_values:
    - [3.0, 1.5]
    - [3.2, 1.4]
  interference.negative_target_values:
    - [1.0, 3.0]
    - [0.5, 3.5]

Use this to tune how strongly the conditional model is pushed toward the desired property regime.

Gradient Guidance¶

For GG, sweep the gradient scale, clipping, and guidance window:

base_config: docs/cfg_examples/gen_gradient_guidance.yaml

parameters:
  interference.condition_configs.gg_scale: [0.0001, 0.0005, 0.001]
  interference.condition_configs.max_norm: [0.0005, 0.001, 0.005]
  interference.condition_configs.guidance_at: [1, 50, 100]
  interference.condition_configs.guidance_stop: [0, 50]
  interference.condition_configs.n_backwards: [0, 1, 3]

Use this when guidance is too weak, unstable, or over-optimizes at the cost of molecular quality.

Hybrid CFG/GG¶

For hybrid guidance, sweep both CFG and GG strengths:

base_config: docs/cfg_examples/gen_hybrid_cfg_gg.yaml

parameters:
  interference.condition_configs.cfg_scale: [0.5, 1.0, 1.5]
  interference.condition_configs.gg_scale: [0.0001, 0.0005, 0.001]
  interference.condition_configs.max_norm: [0.0005, 0.001]
  interference.condition_configs.n_backwards: [0, 1, 3]

This is useful when CFG gives the correct broad direction but GG is needed to sharpen a numeric objective.

6. Registering Workflow CSVs¶

Your evaluation workflow should write CSVs under the run output directory:

<run_output_dir>/optimized_xyz/merged_hits.csv
<run_output_dir>/postbuster_metrics.csv
<run_output_dir>/custom_metrics.csv

collect:
  - path: optimized_xyz/merged_hits.csv
    metrics:
      n_total:
        agg: len
      hit_rate_both:
        column: both_hit
        agg: mean
        round: 4
      n_both_hit:
        column: both_hit
        agg: sum

  - path: custom_metrics.csv
    metrics:
      mean_custom_score:
        column: score
        agg: mean
        round: 4
      n_custom_pass:
        column: passed
        agg: sum

path is always relative to the individual run output directory.

Supported aggregations:

Aggregation	Meaning
`len`	Total number of rows. `column` is optional.
`mean`	Mean of a column after dropping missing values.
`sum`	Sum of a column after dropping missing values.
`count`	Non-null count of a column.

You can drop sentinel values before aggregation:

mean_eff_dihedral:
  column: eff_dihedral
  agg: mean
  exclude_value: -1.0
  round: 3

Important

If collect is omitted, the sweep uses built-in defaults for:

optimized_xyz/merged_hits.csv
postbuster_metrics.csv
xyz_analysis.csv
scscore_results.csv
u.csv

If collect is present, it replaces the built-in defaults. Include every metric you want in summary.csv.

7. Bayesian Optimization Objective¶

Grid search evaluates combinations in deterministic order. Bayesian search chooses from the same candidate grid using previous rows in summary.csv.

search:
  method: bayesian
  include_base_config: true
  objective:
    metric: hit_rate_both
    mode: max
  bayesian:
    n_initial: 3
    random_state: 0

The objective metric must be collected into summary.csv. For example, this objective:

objective:
  metric: hit_rate_both
  mode: max

requires:

collect:
  - path: optimized_xyz/merged_hits.csv
    metrics:
      hit_rate_both:
        column: both_hit
        agg: mean

For minimization:

objective:
  metric: mean_rmsd
  mode: min

Bayesian mode requires Optuna. If Optuna is unavailable, install it or switch to:

search:
  method: grid

8. Resume Behavior¶

Each completed or failed run is recorded in:

<sweep_dir>/summary.csv

By default, any recorded configuration is skipped on the next launch:

MolCraftDiff generate-sweep configs/sweep_example.yaml --max-runs 10

To retry failed or interrupted rows:

MolCraftDiff generate-sweep configs/sweep_example.yaml --retry-failed --max-runs 10

The configuration identity is based on the swept parameter values, not the output directory name.

9. Practical Workflow¶

Create a normal generation config for the controlled generation mode.
Create a sweep config that points to it as base_config.
Add only the parameters you actually want to vary.
Make the eval workflow write CSV files under <run_output_dir>.
Register those CSV files and metrics under collect.
Start with:

MolCraftDiff generate-sweep configs/sweep_example.yaml --dry-run --max-runs 1

Launch a small batch:

MolCraftDiff generate-sweep configs/sweep_example.yaml --max-runs 3

Inspect:

sweep_results/example_sweep/summary.csv

Continue the sweep when ready:

MolCraftDiff generate-sweep configs/sweep_example.yaml --max-runs 20

The sweep will skip configurations already present in summary.csv.