MolecularDiffusion.runmodes.analyze.xyz2mol¶

Functions¶

`extract_scaffold_and_fingerprints`(smiles_iter[, fp_bits])	Sanitizes SMILES strings and computes molecular descriptors:
`load_file_list_from_dir`(→ pandas.DataFrame)	Lists all .xyz files in the given directory and returns a DataFrame
`main`()	Main function to parse command-line arguments and initiate the processing
`run_processing`(→ None)	Processes each row of a DataFrame to generate SMILES strings from XYZ files.
`sanitize_smiles`(→ Optional[str])	Sanitizes a single SMILES string and returns its canonical form.

Module Contents¶

MolecularDiffusion.runmodes.analyze.xyz2mol.extract_scaffold_and_fingerprints(smiles_iter, fp_bits=2048)¶

Sanitizes SMILES strings and computes molecular descriptors: Morgan fingerprints, Bemis-Murcko scaffolds, and BRICS substructure counts.

Parameters:

smiles_iter – An iterable of SMILES strings.
fp_bits (int) – Number of bits for the Morgan fingerprint.

Returns:

A tuple containing:

fps (np.ndarray): Array of Morgan fingerprints.
scaffolds (list): List of Bemis-Murcko scaffold SMILES.
clean_smiles (list): List of canonicalized and sanitized SMILES.
n_fail (int): Number of SMILES strings that failed processing.
substruct_counts (dict): Dictionary of BRICS substructure counts.

Return type:

tuple

MolecularDiffusion.runmodes.analyze.xyz2mol.load_file_list_from_dir(xyz_dir: str) → pandas.DataFrame¶

Lists all .xyz files in the given directory and returns a DataFrame with a single column ‘xyz_file’.

Parameters:: xyz_dir (str) – Path to the directory containing .xyz files.
Returns:: DataFrame with ‘xyz_file’ column.
Return type:: pd.DataFrame

MolecularDiffusion.runmodes.analyze.xyz2mol.main()¶: Main function to parse command-line arguments and initiate the processing of XYZ files to generate SMILES strings, followed by fingerprint and scaffold extraction. All 2D representation outputs are saved in a ‘2d_reprs’ subdirectory within xyz_dir.

MolecularDiffusion.runmodes.analyze.xyz2mol.run_processing(df: pandas.DataFrame, xyz_dir: str, label: str | None, output_csv_filepath: pathlib.Path, timeout: int = 30, verbose: bool = True) → None¶

Processes each row of a DataFrame to generate SMILES strings from XYZ files. Uses multiprocessing to handle files, with a timeout for each conversion. Saves the results to a CSV file at the specified output_csv_filepath.

Parameters:

df (pd.DataFrame) – DataFrame containing a column with XYZ file names.
xyz_dir (str) – Path to the directory containing the XYZ files.
label (Optional[str]) – Label to assign to the processed files.
output_csv_filepath (Path) – The full path where the processed SMILES CSV will be saved.
timeout (int) – Maximum time in seconds for each XYZ to SMILES conversion process. Defaults to 30.
verbose (bool) – If True, print per-row messages and a summary. Defaults to True.

MolecularDiffusion.runmodes.analyze.xyz2mol.sanitize_smiles(smiles: str) → str | None¶

Sanitizes a single SMILES string and returns its canonical form.

Parameters:

smiles (str) – The SMILES string to sanitize.

Returns:

Canonical SMILES string if valid and sanitization succeeds,: otherwise None.

Return type:

Optional[str]