Tutorial: Sampling

Sampling generates molecules from a prior (or a previously trained agent) without any training. It is the quickest way to inspect what a model produces and to verify your setup before running TL or RL.

Key Parameters

Parameter

Description

model_file

Path to the model file (.prior or .chkpt).

smiles_file

Path to the input SMILES file (required for LibInvent, LinkInvent, and Mol2Mol).

output_file

Path to the output CSV file where results will be saved.

num_smiles

Number of SMILES to generate. For generators with input (LibInvent, LinkInvent, Mol2Mol), this is the number generated per input SMILES.

unique_molecules

Remove duplicates and canonicalize SMILES before writing output.

randomize_smiles

Randomly shuffle atom order in the input SMILES before passing to the model. Improves diversity, especially for Mol2Mol.

sample_strategy

multinomial (default, stochastic) or beamsearch (deterministic, always unique). Mol2Mol only.

temperature

Controls randomness in multinomial sampling. < 1 more deterministic, > 1 more random. Default 1.0. Mol2Mol only.

Configuration

All four generators share the same run_type. Only the [parameters] block differs.

Reinvent — de novo, no input required

run_type = "sampling"
device = "cuda:0"  # or "cpu"

[parameters]
model_file = "priors/reinvent.prior"
output_file = "sampling.csv"
num_smiles = 100
unique_molecules = true
randomize_smiles = true

LibInvent — decorate a scaffold

smiles_file contains one scaffold SMILES per line with attachment points marked as [*:0], [*:1], etc. Example: doc/data/scaffolds.smi.

run_type = "sampling"
device = "cuda:0"

[parameters]
model_file = "priors/libinvent.prior"
smiles_file = "doc/data/scaffolds.smi"
output_file = "sampling.csv"
num_smiles = 100
unique_molecules = true
randomize_smiles = true

Mol2Mol — generate analogues

smiles_file contains one reference SMILES per line (optional name in the second column). Example: doc/data/mol2mol.smi. Beam search is deterministic; multinomial sampling adds randomness via temperature.

run_type = "sampling"
device = "cuda:0"

[parameters]
model_file = "priors/mol2mol_medium_similarity.prior"
smiles_file = "doc/data/mol2mol.smi"
sample_strategy = "multinomial"  # or "beamsearch"
temperature = 1.0
output_file = "sampling.csv"
num_smiles = 100
unique_molecules = true
randomize_smiles = true

Running

reinvent sampling.toml

Output

The output CSV contains one row per generated molecule:

Column

Description

SMILES

Generated molecule in canonical SMILES

Input_SMILES

Seed SMILES (LibInvent, LinkInvent, Mol2Mol only)

NLL

Negative log-likelihood — lower means the model considers this molecule more probable under the prior

Example Input Files

Example input files for conditional generators are provided in doc/data/:

  • scaffolds.smi — LibInvent: azanaphthalene scaffolds with two attachment points

  • warheads.smi — LinkInvent: three warhead pairs

  • mol2mol.smi — Mol2Mol: two ChEMBL compounds and celecoxib

What to Check

  • Valid SMILES rate: if many rows are empty or invalid, the prior may not match the generator type.

  • NLL distribution: a narrow NLL range means low diversity; consider increasing temperature or disabling unique_molecules temporarily to inspect.

  • Chemical diversity: sample output is a good sanity check before committing to a TL or RL run.