Tutorial 9: Analyze Module - 3D Molecular Structure Analysis¶
This tutorial covers the analyze module, which provides tools for post-generation analysis and validation of 3D molecular structures.
Overview¶
The analyze module includes six subcommands:
Command |
Description |
|---|---|
|
XTB geometry optimization |
|
Validity/connectivity metrics |
|
RMSD and energy comparison |
|
XYZ to SMILES + fingerprints |
|
XTB electronic properties |
|
Fixed-size molecular feature vectors (SOAP / UMA / SSL3D) |
Access the CLI with:
MolCraftDiff analyze --help
Part 1: XTB Geometry Optimization¶
Optimize generated structures using xTB (GFN1, GFN2, GFN-FF) or MMFF94.
Usage¶
MolCraftDiff analyze optimize gen_xyz/ --level gfn2 --charge 0
Options¶
Option |
Default |
Description |
|---|---|---|
|
|
Output directory |
|
|
Optimization level: gfn1, gfn2, gfn-ff, mmff94 |
|
|
Molecular charge |
|
|
Timeout per molecule (seconds) |
|
|
Covalent radii scale factor |
Output¶
Optimized XYZ files saved to output_dir/ with same filenames.
Part 2: Validity & Connectivity Metrics¶
Compute structural validation metrics for generated molecules.
Usage¶
MolCraftDiff analyze metrics gen_xyz/ --metrics all
Metric Types¶
Type |
Description |
|---|---|
|
Basic validity (connectivity, atom stability) |
|
Bond lengths, angles, clashes |
|
Aromatic-aware stability metrics |
|
All of the above |
Options¶
Option |
Default |
Description |
|---|---|---|
|
None |
Output CSV file |
|
|
Metric type to compute |
|
False |
Recheck topology using RDKit |
|
False |
Check strain via XTB optimization |
|
|
XYZ to mol converter |
Part 3: Compare to Optimized Geometries¶
Compare generated structures with their optimized counterparts.
Prerequisites¶
Run optimization first to create optimized_xyz/ subdirectory:
MolCraftDiff analyze optimize gen_xyz/
Usage¶
MolCraftDiff analyze compare gen_xyz/ --level gfn2
Computed Metrics¶
RMSD: Root Mean Square Deviation between original and optimized
Energy Difference: xTB energy change
Bond Geometry: Bond length and angle deviations
Output¶
Results saved to CSV with per-molecule metrics.
Part 4: XYZ to SMILES Conversion¶
Convert 3D XYZ files to 2D SMILES and extract molecular fingerprints.
Usage¶
MolCraftDiff analyze xyz2mol gen_xyz/ --bits 2048
Output Files (in xyz_dir/2d_reprs/)¶
File |
Description |
|---|---|
|
Filename → SMILES mapping |
|
Morgan fingerprints array |
|
Murcko scaffolds |
|
Substructure counts |
Part 5: XTB Electronic Properties¶
Compute quantum-chemical descriptors at GFN-xTB level using morfeus.
Usage¶
# Basic energy properties
MolCraftDiff analyze xtb-electronic gen_xyz/ -p energy
# All properties with JSON output
MolCraftDiff analyze xtb-electronic gen_xyz/ -p all -f json -o results.json
# ASE database for downstream analysis
MolCraftDiff analyze xtb-electronic gen_xyz/ -p all -f ase -o results.db
Property Groups¶
Molecular-level:
Group |
Properties |
|---|---|
|
HOMO, LUMO, HOMO-LUMO gap |
|
Dipole vector and magnitude |
|
Ionization potential, electron affinity |
|
Electrophilicity, nucleophilicity, fugalities |
Atomic-level:
Group |
Properties |
|---|---|
|
Mulliken atomic charges |
|
Fukui indices (f⁺, f⁻, radical, dual) |
|
Wiberg bond orders |
Output Formats¶
Format |
Description |
|---|---|
|
Molecular-level properties (one row per molecule) |
|
Full data including atomic-level properties |
|
ASE database with properties in atoms.info/arrays |
|
Generate all three formats |
Options¶
Option |
Default |
Description |
|---|---|---|
|
|
XTB method: 1=GFN1, 2=GFN2, ptb=PTB |
|
|
Molecular charge |
|
|
Property groups to compute |
|
|
Output format |
|
True |
Apply empirical IP/EA correction |
|
|
Parallel jobs |
Part 6: Featurize — Fixed-Size Molecular Vectors¶
Convert a directory of XYZ files into a fixed-size feature matrix for downstream machine learning (clustering, regression, dimensionality reduction, etc.).
Three backends are available:
Backend |
Description |
GPU needed |
|---|---|---|
|
SOAP descriptor via dscribe |
No |
|
UMA backbone embeddings from pretrained fairchem model |
Optional |
|
Embeddings from a trained SSL3D checkpoint |
Optional |
Usage¶
# SOAP (default) — uses the built-in species list
MolCraftDiff analyze featurize gen_xyz/
# SOAP — auto-detect species from the files
MolCraftDiff analyze featurize gen_xyz/ --autodetect
# SOAP — specify species explicitly
MolCraftDiff analyze featurize gen_xyz/ --species C --species H --species N --species O
# SOAP — custom descriptor parameters
MolCraftDiff analyze featurize gen_xyz/ --n-max 12 --l-max 9 --r-cut 8.0
# UMA — CPU inference
MolCraftDiff analyze featurize gen_xyz/ --backend uma --device cpu
# UMA — GPU inference with custom checkpoint
MolCraftDiff analyze featurize gen_xyz/ --backend uma --device cuda \
--checkpoint training_outputs/uma-s-1p2.pt
# UMA — all spherical components (higher-dimensional embedding)
MolCraftDiff analyze featurize gen_xyz/ --backend uma --all-components
# SSL3D — use a trained SSL3D checkpoint
MolCraftDiff analyze featurize gen_xyz/ --backend ssl3d \
--ssl3d-checkpoint runs/last.ckpt --device cuda
SOAP Options¶
Option |
Default |
Description |
|---|---|---|
|
False |
Detect element species from files (overrides |
|
See below |
Element symbols; repeatable: |
|
|
Cutoff radius in Å |
|
|
Radial basis functions |
|
|
Angular basis functions |
|
|
Gaussian smearing width |
|
|
Atom pooling mode: |
|
|
Parallel workers |
Default species list:
H B C N O F Al Si P S Cl As Se Br I Hg Bi
Elements found in the files that are not in the species list are added automatically with a warning, so the run never fails silently on unseen atoms.
UMA Options¶
Option |
Default |
Description |
|---|---|---|
|
|
Path to UMA checkpoint |
|
|
UMA task name |
|
auto |
|
|
|
Molecules per UMA forward pass |
|
|
Total molecular charge applied to all structures |
|
|
Spin multiplicity applied to all structures |
|
False |
Use all spherical components instead of L=0 scalars only |
|
|
Atom pooling mode: |
The UMA backend requires the vendored fairchem source tree at <repo_root>/fairchem/src.
If it is not found, clone it with:
git clone https://github.com/pregHosh/fairchem fairchem
Then run from the repository root, or set:
export MOLCRAFT_REPO_ROOT=/path/to/MolCraftDiffusion
SSL3D Options¶
Option |
Default |
Description |
|---|---|---|
|
None |
Required path to a trained SSL3D |
|
|
Radius graph cutoff in Å |
|
auto |
|
|
|
Molecules per SSL3D forward pass |
|
|
Atom pooling mode: |
Output Files¶
Three files are written to the output stem (default: input_dir/features):
File |
Content |
|---|---|
|
|
|
Row index → source file + frame mapping |
|
Backend, all parameters, feature dim, timestamp |
# Custom output stem
MolCraftDiff analyze featurize gen_xyz/ -o results/soap_features
# writes: results/soap_features.npy / .csv / _meta.json
Loading the Output¶
import numpy as np
import pandas as pd
features = np.load("gen_xyz/features.npy") # (N, D) float32
index = pd.read_csv("gen_xyz/features.csv") # maps row → file/frame
print(features.shape) # e.g. (160, 81396) for SOAP, (160, 128) for UMA
Python API¶
from MolecularDiffusion.runmodes.analyze.featurize import run_featurize
# SOAP
features = run_featurize(
input_dir="gen_xyz/",
backend="soap",
output_path="gen_xyz/soap_features",
n_max=12,
l_max=9,
)
# UMA — pass a pre-loaded list of ASE Atoms directly
from MolecularDiffusion.runmodes.analyze.uma_embeddings import get_uma_molecule_embeddings
from ase.io import read
atoms_list = [read("gen_xyz/molecule_0000.xyz"), read("gen_xyz/molecule_0001.xyz")]
results = get_uma_molecule_embeddings(
source=atoms_list,
checkpoint_path="training_outputs/uma-s-1p2.pt",
device="cpu",
charge=0,
spin=1,
)
mol_emb = results[0]["molecule_embedding"] # torch.Tensor, shape (128,)
node_emb = results[0]["node_embedding"] # torch.Tensor, shape (n_atoms, 128)
Example Workflow¶
A typical post-generation analysis workflow:
# 1. Generate molecules
MolCraftDiff generate gen_config.yaml
# 2. Optimize geometries
MolCraftDiff analyze optimize gen_xyz/ -l gfn2 -o gen_xyz/optimized_xyz
# 3. Compute validity metrics
MolCraftDiff analyze metrics gen_xyz/optimized_xyz -o metrics.csv
# 4. Compare to optimized structures
MolCraftDiff analyze compare gen_xyz/
# 5. Convert to SMILES for downstream analysis
MolCraftDiff analyze xyz2mol gen_xyz/optimized_xyz
# 6. Compute electronic properties
MolCraftDiff analyze xtb-electronic gen_xyz/optimized_xyz -p all -f ase -o electronic.db
# 7. Featurize for downstream ML (SOAP)
MolCraftDiff analyze featurize gen_xyz/optimized_xyz -o gen_xyz/soap_features
# 7b. Or use UMA embeddings (requires fairchem checkout + checkpoint)
MolCraftDiff analyze featurize gen_xyz/optimized_xyz --backend uma --device cuda \
-o gen_xyz/uma_features
# 7c. Or use SSL3D embeddings from a trained SSL3D checkpoint
MolCraftDiff analyze featurize gen_xyz/optimized_xyz --backend ssl3d \
--ssl3d-checkpoint runs/last.ckpt -o gen_xyz/ssl3d_features
Python API¶
All analyze functions are also available programmatically:
from MolecularDiffusion.runmodes.analyze import (
optimize_molecule,
get_xtb_optimized_xyz,
compute_xtb_electronic,
batch_xtb_electronic,
run_compare_analysis,
run_xyz2mol,
)
from MolecularDiffusion.runmodes.analyze.featurize import run_featurize
from MolecularDiffusion.runmodes.analyze.uma_embeddings import get_uma_molecule_embeddings
# Compute electronic properties for single file
result = compute_xtb_electronic(
"molecule.xyz",
method=2,
properties=["energy", "charges"]
)
print(result["homo"], result["lumo"])
# Batch processing
df = batch_xtb_electronic(
input_dir="gen_xyz/",
output_path="results.csv",
output_format="csv",
properties=["energy", "reactivity"],
)