MolecularDiffusion.utils.geom_metrics

Attributes

Functions

check_chem_validity(mol_list[, skip_idx, verbose])

Analyze a list of RDKit molecules for chemical validity based on atom valencies

check_neutrality(filename)

Checks if a molecule (specified by an XYZ file) is neutral by running an xTB calculation.

check_validity_v0(data[, angle_relax, scale_factor, ...])

Validate a molecular structure based on atomic distances and angles (Version 0).

check_validity_v1(data[, score_threshold, ...])

Validate a molecular structure based on atomic distances and angles (Version 1).

compare_graph_topology(graph1, graph2)

Compare the topology of two PyG graphs by checking their edge indices.

compute_drug_likeness(→ dict)

Compute RDKit-based drug-likeness metrics including SAScore and QED.

compute_esp_similarity(→ float)

Compute ESP similarity using Gaussian-weighted overlap (ESP Tanimoto).

compute_mol_esp_values(→ numpy.ndarray)

Compute electrostatic potential (ESP) at surface points or per-atom charges.

compute_mol_pharmacophores(→ tuple)

Extract pharmacophore points and vectors using shepherd-score implementation.

compute_mol_surface_pts(→ numpy.ndarray)

Gets the point cloud representation of a molecule's van der Waals surface using

compute_pharm_similarity(→ float)

Direction-aware Gaussian overlap Tanimoto for pharmacophores.

compute_shape_similarity(→ float)

Compute shape similarity using Gaussian-weighted volume overlap (Tanimoto).

is_fully_connected(edge_index, num_nodes)

Determines if the graph is fully connected.

kabsch_align(→ numpy.ndarray)

Standard Kabsch alignment for 1:1 corresponding point sets.

load_molecules_from_xyz(xyz_dir)

Converts all XYZ files in a directory to RDKit Mol objects.

optimize_shape_alignment(pts_gen, pts_ref[, alpha, ...])

Optimize alignment of non-corresponding point clouds to maximize Gaussian overlap.

pca_align(→ numpy.ndarray)

Rotate pts_fit to align its principal axes with those of pts_ref.

run_postbuster(mols[, timeout, batch_size])

Run PoseBusters on a list of RDKit molecules, optionally in batches, with a timeout per batch.

runner(args)

Main runner function to process a directory of XYZ files, compute geometric metrics,

smilify_wrapper(xyzs, xyz2mol)

Convert a list of XYZ files to SMILES strings using a provided xyz2mol function.

xyz_to_pdb(xyz_file_path, pdb_file_path)

Converts an XYZ file to a PDB file using OpenBabel (via pybel).

xyz_to_rdkit_mol(→ rdkit.Chem.Mol)

Convert an XYZ file to an RDKit molecule with perceived bonds.

Module Contents

MolecularDiffusion.utils.geom_metrics.check_chem_validity(mol_list, skip_idx=[], verbose=0)

Analyze a list of RDKit molecules for chemical validity based on atom valencies and identify broken or disconnected fragments.

Parameters:
  • mol_list (list of rdkit.Chem.Mol) – A list of RDKit Mol objects to be checked.

  • skip_idx (list of int, optional) – Atom indices to skip when counting and checking (currently unused in logic; provided for future extension). Default is None.

  • verbose (int, default=0) – If > 0, prints detailed information about each valency violation as: “<SMILES> has invalid valency: <AtomSymbol> <TotalValence> <FormalCharge>”.

Returns:

  • natom_stability_dicts (dict[str, int]) – Counts of atoms whose total electron count (valence minus formal charge) matches a known valid valency (i.e. “stable” atoms), keyed by atom symbol.

  • natom_tot_dicts (dict[str, int]) – Total counts of atoms encountered, keyed by atom symbol.

  • good_smiles (list of str) – Canonical SMILES strings for molecules deemed chemically valid and not containing disconnected fragments (“.”).

  • bad_smiles_broken (list of str) – Canonical SMILES for molecules that are chemically valid but contain disconnected fragments (i.e. salts or mixtures with “.” in the SMILES).

  • bad_smiles_chem (list of str) – Canonical SMILES for molecules that failed the valency checks.

Notes

  • This function relies on a pre-defined mapping valid_valencies:
    >>> valid_valencies = {
    ...     'H': {1}, 'C': {4}, 'N': {3, 5}, 'O': {2}, ...
    ... }
    

    which maps each element symbol to the set of permitted electron counts.

  • The skip_idx argument is currently not applied in the loop; if you wish to ignore certain atoms (for example, metals or explicit hydrogens), you could uncomment and adapt the skip logic.

Example

>>> from rdkit.Chem import MolFromSmiles
>>> smiles_list = ['CCO', 'C[N+](C)(C)C', 'C.C']  # ethanol, tetramethylammonium, disconnected C and C
>>> mols = [MolFromSmiles(s) for s in smiles_list]
>>> valid, total, good, broken, bad = check_chem_validity(mols, verbose=1)
 # would print any valency errors if present
MolecularDiffusion.utils.geom_metrics.check_neutrality(filename)

Checks if a molecule (specified by an XYZ file) is neutral by running an xTB calculation.

It runs ‘xtb <filename> –ptb’ and checks the output log for messages indicating a mismatch between electrons and spin multiplicity.

Parameters:

filename (str) – Path to the XYZ file containing the molecule.

Returns:

True if the molecule is neutral (no mismatch found), False otherwise.

Return type:

bool

MolecularDiffusion.utils.geom_metrics.check_validity_v0(data, angle_relax=10, scale_factor=1.3, verbose=False)

Validate a molecular structure based on atomic distances and angles (Version 0).

Parameters:
  • data – A dictionary containing molecular information. - ‘atomic_numbers’: List of integers representing the atomic number of each atom. - ‘positions’: List of tuples (x, y, z) representing the position of each atom.

  • angle_relax (float) – Tolerance allowed for bond angles in degrees. Default is 10.0.

  • scale_factor (float) – The scaling factor to apply to the covalent radii. Default is 1.3.

  • verbose (bool) – Whether to print debug messages during validation. Default is False.

Returns:

Contains the following elements:
  • is_valid (bool): Boolean indicating if the structure is valid.

  • percent_atom_valid (float): Percentage of atoms that meet the criteria.

  • num_components (int): Number of connected components in the molecular graph.

  • bad_atoms (list): List of indices for atoms that do not meet the criteria.

  • needs_rechecking (bool): Whether further checks are needed due to borderline cases or special conditions.

Return type:

tuple

Notes

This function assumes that ‘data’ contains valid atomic numbers and positions. The validation process involves checking bond distances and angles against predefined reference values. Special handling is applied for atoms with certain atomic numbers, such as carbon (atomic number 6), which may require different criteria due to their bonding behavior.

MolecularDiffusion.utils.geom_metrics.check_validity_v1(data, score_threshold=3, scale_factor=1.3, skip_indices=[], verbose=False)

Validate a molecular structure based on atomic distances and angles (Version 1).

Parameters:
  • data – A dictionary containing molecular information. - ‘atomic_numbers’: List of integers representing the atomic number of each atom. - ‘positions’: List of tuples (x, y, z) representing the position of each atom.

  • score_threshold (float) – Tolerance allowed for shape values. Default is 3.0.

  • scale_factor (float) – The scaling factor to apply to the covalent radii. Default is 1.3.

  • skip_indices (list) – List of atom indices to skip during validation.

  • verbose (bool) – Whether to print debug messages during validation. Default is False.

Returns:

Contains the following elements:
  • is_valid (bool): Boolean indicating if the structure is valid.

  • percent_atom_valid (float): Percentage of atoms that meet the criteria.

  • num_components (int): Number of connected components in the molecular graph.

  • bad_atom_chem (list): List of indices for atoms that do not meet chemical valency criteria.

  • bad_atom_distort (list): List of indices for atoms that are geometrically distorted.

Return type:

tuple

Notes

This function assumes that ‘data’ contains valid atomic numbers and positions. The validation process involves checking bond distances and angles against predefined reference values. Special handling is applied for atoms with certain atomic numbers, such as carbon (atomic number 6), which may require different criteria due to their bonding behavior.

MolecularDiffusion.utils.geom_metrics.compare_graph_topology(graph1, graph2)

Compare the topology of two PyG graphs by checking their edge indices.

Parameters:
  • graph1 (torch_geometric.data.Data) – The first PyG graph.

  • graph2 (torch_geometric.data.Data) – The second PyG graph.

Returns:

True if the graphs have the same topology, False otherwise.

Return type:

bool

MolecularDiffusion.utils.geom_metrics.compute_drug_likeness(mol: rdkit.Chem.Mol) dict

Compute RDKit-based drug-likeness metrics including SAScore and QED.

MolecularDiffusion.utils.geom_metrics.compute_esp_similarity(pts_gen, esp_gen, pts_ref, esp_ref, center=True, align=True) float

Compute ESP similarity using Gaussian-weighted overlap (ESP Tanimoto).

MolecularDiffusion.utils.geom_metrics.compute_mol_esp_values(mol, surf_pts: numpy.ndarray = None) numpy.ndarray

Compute electrostatic potential (ESP) at surface points or per-atom charges.

Matches shepherd-score implementation: computes Coulomb potential at surface points if provided, otherwise returns per-atom Gasteiger charges.

Parameters:
  • mol (rdkit.Chem.Mol) – RDKit molecule with conformer.

  • surf_pts (np.ndarray, optional) – Surface point coordinates (M, 3). If provided, computes ESP at these points. If None, returns per-atom Gasteiger charges.

Returns:

ESP values. If surf_pts provided: (M,) potential at surface points. If surf_pts is None: (N,) charges for each atom.

Return type:

np.ndarray

MolecularDiffusion.utils.geom_metrics.compute_mol_pharmacophores(mol, multi_vector: bool = True, exclude: list = [], check_access: bool = False, scale: float = 1.0) tuple

Extract pharmacophore points and vectors using shepherd-score implementation.

Returns (pharm_pos, pharm_types, pharm_vecs):
  • pharm_pos: (M, 3) float32 anchor positions

  • pharm_types: (M,) int32 type indices

  • pharm_vecs: (M, 3) float32 direction vectors (optional)

Pharmacophore types:

0: Acceptor, 1: Donor, 2: Aromatic, 3: Hydrophobe, 4: Halogen, 5: Cation, 6: Anion, 7: ZnBinder, 8: Dummy

Parameters:
  • mol (rdkit.Chem.Mol) – RDKit molecule with conformer.

  • multi_vector (bool (default = True)) – Whether to represent pharmacophores with multiple vectors.

  • exclude (list (default = [])) – List of hydrogen indices to exclude from HBD detection.

  • check_access (bool (default = False)) – Check if HBD/HBA are accessible to molecular surface.

  • scale (float (default = 1.0)) – Length of pharmacophore vector in Angstroms.

Returns:

(pharm_positions, pharm_types, pharm_vectors) where: - pharm_positions: (M, 3) anchor positions - pharm_types: (M,) type indices (matching P_TYPES) - pharm_vectors: (M, 3) direction vectors

Return type:

tuple

MolecularDiffusion.utils.geom_metrics.compute_mol_surface_pts(mol, num_pts: int = 75, probe_radius: float = 1.2) numpy.ndarray

Gets the point cloud representation of a molecule’s van der Waals surface using mesh-based surface generation (matching shepherd-score implementation).

Uses Open3D’s ball-pivoting algorithm for accurate surface mesh generation. Takes into account the vdW radii of different atoms. Removes overlapping points within vdW radii of neighboring atoms.

Parameters:
  • mol (rdkit.Chem.Mol object) – RDKit molecule object with a conformer.

  • num_pts (int (default = 75)) – The total number of points in the final point cloud.

  • probe_radius (float (default = 1.2)) – The radius of a probe atom to act as a “solvent accessible surface”. Default = 1.2 angstroms which is the radius of a Hydrogen atom.

Returns:

Coordinates of points representing the molecular surface, shape (num_pts, 3).

Return type:

np.ndarray

MolecularDiffusion.utils.geom_metrics.compute_pharm_similarity(pts_gen, types_gen, pts_ref, types_ref, vecs_gen=None, vecs_ref=None, center=True, align=True) float

Direction-aware Gaussian overlap Tanimoto for pharmacophores.

MolecularDiffusion.utils.geom_metrics.compute_shape_similarity(pts_gen, pts_ref, center=True, align=True) float

Compute shape similarity using Gaussian-weighted volume overlap (Tanimoto).

MolecularDiffusion.utils.geom_metrics.is_fully_connected(edge_index, num_nodes)

Determines if the graph is fully connected.

Parameters:
  • edge_index (torch.Tensor) – The edge indices of the graph.

  • num_nodes (int) – The number of nodes in the graph.

Returns:

(bool, int)
  • bool: True if the graph is fully connected, False otherwise.

  • int: The number of connected components in the graph.

Return type:

tuple

MolecularDiffusion.utils.geom_metrics.kabsch_align(pts_fit: numpy.ndarray, pts_ref: numpy.ndarray) numpy.ndarray

Standard Kabsch alignment for 1:1 corresponding point sets. Returns pts_fit rotated and translated to align with pts_ref.

MolecularDiffusion.utils.geom_metrics.load_molecules_from_xyz(xyz_dir)

Converts all XYZ files in a directory to RDKit Mol objects.

It first converts XYZ files to PDB using OpenBabel, then loads the PDBs into RDKit.

Parameters:

xyz_dir (str) – Directory containing XYZ files.

Returns:

(valid_molecules, pass_xyz_files)
  • valid_molecules (list of RDKit Mol): Successfully loaded molecules.

  • pass_xyz_files (list of str): Filenames of the successfully loaded molecules.

Return type:

tuple

MolecularDiffusion.utils.geom_metrics.optimize_shape_alignment(pts_gen: numpy.ndarray, pts_ref: numpy.ndarray, alpha=ALPHA_DEFAULT, max_steps=100)

Optimize alignment of non-corresponding point clouds to maximize Gaussian overlap. Matches the ShEPhERD-score ROCS-style optimization logic. Returns (aligned_points, rotation_matrix).

MolecularDiffusion.utils.geom_metrics.pca_align(pts_fit: numpy.ndarray, pts_ref: numpy.ndarray) numpy.ndarray

Rotate pts_fit to align its principal axes with those of pts_ref.

Works for point clouds of different sizes (no correspondence required). Tries all 4 axis-flip combinations and returns the rotation that maximises the sum of dot products between corresponding principal axes (best sign match).

Both inputs must already be centered at the origin.

Parameters:
  • pts_fit (np.ndarray (N, 3) — points to rotate (pre-centered))

  • pts_ref (np.ndarray (M, 3) — reference points (pre-centered))

Returns:

pts_fit_rotated

Return type:

np.ndarray (N, 3)

MolecularDiffusion.utils.geom_metrics.run_postbuster(mols, timeout=60, batch_size=1)

Run PoseBusters on a list of RDKit molecules, optionally in batches, with a timeout per batch.

This function processes molecules using the PoseBusters library to compute various geometric checks. Processing happens in separate processes to enforce timeouts.

Parameters:
  • mols (list of RDKit Mol) – List of molecules to evaluate.

  • timeout (int, optional) – Maximum time (in seconds) allowed for each batch calculation. Default is 60.

  • batch_size (int, optional) – Number of molecules to process in a single batch. If None, processes all molecules in one batch. Default is None.

Returns:

DataFrame containing PoseBusters results for all processed molecules.

Returns None if no results could be obtained.

Return type:

pd.DataFrame or None

MolecularDiffusion.utils.geom_metrics.runner(args)

Main runner function to process a directory of XYZ files, compute geometric metrics, check validity, and optionally run strain and diversity checks.

Parameters:

args (argparse.Namespace) – Arguments containing: - input (str): Input directory path containing .xyz files. - output (str, optional): Output CSV file path. - recheck_topo (bool): Whether to recheck topology. - check_strain (bool): Whether to check strain energy. - check_diversity (bool): Whether to compute diversity scores. - skip_atoms (list of int, optional): Atom indices to skip during checks.

MolecularDiffusion.utils.geom_metrics.smilify_wrapper(xyzs, xyz2mol)

Convert a list of XYZ files to SMILES strings using a provided xyz2mol function.

Parameters:
  • xyzs (list of str) – List of paths to XYZ files.

  • xyz2mol (callable) – A function that takes an XYZ file path and returns (smiles, mol).

Returns:

(validity, smiles_list, mol_list, dicts)
  • validity (float): Fraction of successful conversions.

  • smiles_list (list of str): List of SMILES strings (None for failures).

  • mol_list (list of RDKit Mol): List of RDKit Mol objects (None for failures).

  • dicts (dict): Dictionary with ‘smiles’ and ‘filename’ lists.

Return type:

tuple

MolecularDiffusion.utils.geom_metrics.xyz_to_pdb(xyz_file_path, pdb_file_path)

Converts an XYZ file to a PDB file using OpenBabel (via pybel).

Parameters:
  • xyz_file_path (str) – Path to input XYZ file.

  • pdb_file_path (str) – Path to output PDB file.

MolecularDiffusion.utils.geom_metrics.xyz_to_rdkit_mol(xyz_path: str) rdkit.Chem.Mol

Convert an XYZ file to an RDKit molecule with perceived bonds. Tries multiple charges to find a valid structure without radicals.

MolecularDiffusion.utils.geom_metrics.ALPHA_DEFAULT = 0.81
MolecularDiffusion.utils.geom_metrics.COULOMB_SCALING
MolecularDiffusion.utils.geom_metrics.EDGE_THRESHOLD = 4
MolecularDiffusion.utils.geom_metrics.LAM_SCALING
MolecularDiffusion.utils.geom_metrics.P_ALPHAS
MolecularDiffusion.utils.geom_metrics.P_TYPES = ['Acceptor', 'Donor', 'Aromatic', 'Hydrophobe', 'Halogen', 'Cation', 'Anion', 'ZnBinder', 'Dummy']
MolecularDiffusion.utils.geom_metrics.SCALE_FACTOR = 1.2
MolecularDiffusion.utils.geom_metrics.SCORES_THRESHOLD = 3.0
MolecularDiffusion.utils.geom_metrics.is_cosymlib_available = True
MolecularDiffusion.utils.geom_metrics.logger