MolecularDiffusion.utils.geom_metrics¶
Attributes¶
Functions¶
|
Analyze a list of RDKit molecules for chemical validity based on atom valencies |
|
Checks if a molecule (specified by an XYZ file) is neutral by running an xTB calculation. |
|
Validate a molecular structure based on atomic distances and angles (Version 0). |
|
Validate a molecular structure based on atomic distances and angles (Version 1). |
|
Compare the topology of two PyG graphs by checking their edge indices. |
|
Determines if the graph is fully connected. |
|
Converts all XYZ files in a directory to RDKit Mol objects. |
|
Run PoseBusters on a list of RDKit molecules, optionally in batches, with a timeout per batch. |
|
Main runner function to process a directory of XYZ files, compute geometric metrics, |
|
Convert a list of XYZ files to SMILES strings using a provided xyz2mol function. |
|
Converts an XYZ file to a PDB file using OpenBabel (via pybel). |
Module Contents¶
- MolecularDiffusion.utils.geom_metrics.check_chem_validity(mol_list, skip_idx=[], verbose=0)¶
Analyze a list of RDKit molecules for chemical validity based on atom valencies and identify broken or disconnected fragments.
- Parameters:
mol_list (list of rdkit.Chem.Mol) – A list of RDKit Mol objects to be checked.
skip_idx (list of int, optional) – Atom indices to skip when counting and checking (currently unused in logic; provided for future extension). Default is None.
verbose (int, default=0) – If > 0, prints detailed information about each valency violation as: “<SMILES> has invalid valency: <AtomSymbol> <TotalValence> <FormalCharge>”.
- Returns:
natom_stability_dicts (dict[str, int]) – Counts of atoms whose total electron count (valence minus formal charge) matches a known valid valency (i.e. “stable” atoms), keyed by atom symbol.
natom_tot_dicts (dict[str, int]) – Total counts of atoms encountered, keyed by atom symbol.
good_smiles (list of str) – Canonical SMILES strings for molecules deemed chemically valid and not containing disconnected fragments (“.”).
bad_smiles_broken (list of str) – Canonical SMILES for molecules that are chemically valid but contain disconnected fragments (i.e. salts or mixtures with “.” in the SMILES).
bad_smiles_chem (list of str) – Canonical SMILES for molecules that failed the valency checks.
Notes
- This function relies on a pre-defined mapping valid_valencies:
>>> valid_valencies = { ... 'H': {1}, 'C': {4}, 'N': {3, 5}, 'O': {2}, ... ... }
which maps each element symbol to the set of permitted electron counts.
The skip_idx argument is currently not applied in the loop; if you wish to ignore certain atoms (for example, metals or explicit hydrogens), you could uncomment and adapt the skip logic.
Example
>>> from rdkit.Chem import MolFromSmiles >>> smiles_list = ['CCO', 'C[N+](C)(C)C', 'C.C'] # ethanol, tetramethylammonium, disconnected C and C >>> mols = [MolFromSmiles(s) for s in smiles_list] >>> valid, total, good, broken, bad = check_chem_validity(mols, verbose=1) # would print any valency errors if present
- MolecularDiffusion.utils.geom_metrics.check_neutrality(filename)¶
Checks if a molecule (specified by an XYZ file) is neutral by running an xTB calculation.
It runs ‘xtb <filename> –ptb’ and checks the output log for messages indicating a mismatch between electrons and spin multiplicity.
- MolecularDiffusion.utils.geom_metrics.check_validity_v0(data, angle_relax=10, scale_factor=1.3, verbose=False)¶
Validate a molecular structure based on atomic distances and angles (Version 0).
- Parameters:
data – A dictionary containing molecular information. - ‘atomic_numbers’: List of integers representing the atomic number of each atom. - ‘positions’: List of tuples (x, y, z) representing the position of each atom.
angle_relax (float) – Tolerance allowed for bond angles in degrees. Default is 10.0.
scale_factor (float) – The scaling factor to apply to the covalent radii. Default is 1.3.
verbose (bool) – Whether to print debug messages during validation. Default is False.
- Returns:
- Contains the following elements:
is_valid (bool): Boolean indicating if the structure is valid.
percent_atom_valid (float): Percentage of atoms that meet the criteria.
num_components (int): Number of connected components in the molecular graph.
bad_atoms (list): List of indices for atoms that do not meet the criteria.
needs_rechecking (bool): Whether further checks are needed due to borderline cases or special conditions.
- Return type:
Notes
This function assumes that ‘data’ contains valid atomic numbers and positions. The validation process involves checking bond distances and angles against predefined reference values. Special handling is applied for atoms with certain atomic numbers, such as carbon (atomic number 6), which may require different criteria due to their bonding behavior.
- MolecularDiffusion.utils.geom_metrics.check_validity_v1(data, score_threshold=3, scale_factor=1.3, skip_indices=[], verbose=False)¶
Validate a molecular structure based on atomic distances and angles (Version 1).
- Parameters:
data – A dictionary containing molecular information. - ‘atomic_numbers’: List of integers representing the atomic number of each atom. - ‘positions’: List of tuples (x, y, z) representing the position of each atom.
score_threshold (float) – Tolerance allowed for shape values. Default is 3.0.
scale_factor (float) – The scaling factor to apply to the covalent radii. Default is 1.3.
skip_indices (list) – List of atom indices to skip during validation.
verbose (bool) – Whether to print debug messages during validation. Default is False.
- Returns:
- Contains the following elements:
is_valid (bool): Boolean indicating if the structure is valid.
percent_atom_valid (float): Percentage of atoms that meet the criteria.
num_components (int): Number of connected components in the molecular graph.
bad_atom_chem (list): List of indices for atoms that do not meet chemical valency criteria.
bad_atom_distort (list): List of indices for atoms that are geometrically distorted.
- Return type:
Notes
This function assumes that ‘data’ contains valid atomic numbers and positions. The validation process involves checking bond distances and angles against predefined reference values. Special handling is applied for atoms with certain atomic numbers, such as carbon (atomic number 6), which may require different criteria due to their bonding behavior.
- MolecularDiffusion.utils.geom_metrics.compare_graph_topology(graph1, graph2)¶
Compare the topology of two PyG graphs by checking their edge indices.
- Parameters:
graph1 (torch_geometric.data.Data) – The first PyG graph.
graph2 (torch_geometric.data.Data) – The second PyG graph.
- Returns:
True if the graphs have the same topology, False otherwise.
- Return type:
- MolecularDiffusion.utils.geom_metrics.is_fully_connected(edge_index, num_nodes)¶
Determines if the graph is fully connected.
- Parameters:
edge_index (torch.Tensor) – The edge indices of the graph.
num_nodes (int) – The number of nodes in the graph.
- Returns:
- (bool, int)
bool: True if the graph is fully connected, False otherwise.
int: The number of connected components in the graph.
- Return type:
- MolecularDiffusion.utils.geom_metrics.load_molecules_from_xyz(xyz_dir)¶
Converts all XYZ files in a directory to RDKit Mol objects.
It first converts XYZ files to PDB using OpenBabel, then loads the PDBs into RDKit.
- MolecularDiffusion.utils.geom_metrics.run_postbuster(mols, timeout=60, batch_size=1)¶
Run PoseBusters on a list of RDKit molecules, optionally in batches, with a timeout per batch.
This function processes molecules using the PoseBusters library to compute various geometric checks. Processing happens in separate processes to enforce timeouts.
- Parameters:
mols (list of RDKit Mol) – List of molecules to evaluate.
timeout (int, optional) – Maximum time (in seconds) allowed for each batch calculation. Default is 60.
batch_size (int, optional) – Number of molecules to process in a single batch. If None, processes all molecules in one batch. Default is None.
- Returns:
- DataFrame containing PoseBusters results for all processed molecules.
Returns None if no results could be obtained.
- Return type:
pd.DataFrame or None
- MolecularDiffusion.utils.geom_metrics.runner(args)¶
Main runner function to process a directory of XYZ files, compute geometric metrics, check validity, and optionally run strain and diversity checks.
- Parameters:
args (argparse.Namespace) – Arguments containing: - input (str): Input directory path containing .xyz files. - output (str, optional): Output CSV file path. - recheck_topo (bool): Whether to recheck topology. - check_strain (bool): Whether to check strain energy. - check_diversity (bool): Whether to compute diversity scores. - skip_atoms (list of int, optional): Atom indices to skip during checks.
- MolecularDiffusion.utils.geom_metrics.smilify_wrapper(xyzs, xyz2mol)¶
Convert a list of XYZ files to SMILES strings using a provided xyz2mol function.
- Parameters:
- Returns:
- (validity, smiles_list, mol_list, dicts)
validity (float): Fraction of successful conversions.
smiles_list (list of str): List of SMILES strings (None for failures).
mol_list (list of RDKit Mol): List of RDKit Mol objects (None for failures).
dicts (dict): Dictionary with ‘smiles’ and ‘filename’ lists.
- Return type:
- MolecularDiffusion.utils.geom_metrics.xyz_to_pdb(xyz_file_path, pdb_file_path)¶
Converts an XYZ file to a PDB file using OpenBabel (via pybel).
- MolecularDiffusion.utils.geom_metrics.EDGE_THRESHOLD = 4¶
- MolecularDiffusion.utils.geom_metrics.SCALE_FACTOR = 1.2¶
- MolecularDiffusion.utils.geom_metrics.SCORES_THRESHOLD = 3.0¶
- MolecularDiffusion.utils.geom_metrics.is_cosymlib_available = True¶
- MolecularDiffusion.utils.geom_metrics.logger¶