MolecularDiffusion.runmodes.analyze.xyz2mol

Functions

extract_scaffold_and_fingerprints(smiles_iter[, fp_bits])

Sanitizes SMILES strings and computes molecular descriptors:

load_file_list_from_dir(→ pandas.DataFrame)

Lists all .xyz files in the given directory and returns a DataFrame

main()

Main function to parse command-line arguments and initiate the processing

run_processing(→ None)

Processes each row of a DataFrame to generate SMILES strings from XYZ files.

sanitize_smiles(→ Optional[str])

Sanitizes a single SMILES string and returns its canonical form.

Module Contents

MolecularDiffusion.runmodes.analyze.xyz2mol.extract_scaffold_and_fingerprints(smiles_iter, fp_bits=2048)

Sanitizes SMILES strings and computes molecular descriptors: Morgan fingerprints, Bemis-Murcko scaffolds, and BRICS substructure counts.

Parameters:
  • smiles_iter – An iterable of SMILES strings.

  • fp_bits (int) – Number of bits for the Morgan fingerprint.

Returns:

A tuple containing:
  • fps (np.ndarray): Array of Morgan fingerprints.

  • scaffolds (list): List of Bemis-Murcko scaffold SMILES.

  • clean_smiles (list): List of canonicalized and sanitized SMILES.

  • n_fail (int): Number of SMILES strings that failed processing.

  • substruct_counts (dict): Dictionary of BRICS substructure counts.

Return type:

tuple

MolecularDiffusion.runmodes.analyze.xyz2mol.load_file_list_from_dir(xyz_dir: str) pandas.DataFrame

Lists all .xyz files in the given directory and returns a DataFrame with a single column ‘xyz_file’.

Parameters:

xyz_dir (str) – Path to the directory containing .xyz files.

Returns:

DataFrame with ‘xyz_file’ column.

Return type:

pd.DataFrame

MolecularDiffusion.runmodes.analyze.xyz2mol.main()

Main function to parse command-line arguments and initiate the processing of XYZ files to generate SMILES strings, followed by fingerprint and scaffold extraction. All 2D representation outputs are saved in a ‘2d_reprs’ subdirectory within xyz_dir.

MolecularDiffusion.runmodes.analyze.xyz2mol.run_processing(df: pandas.DataFrame, xyz_dir: str, label: str | None, output_csv_filepath: pathlib.Path, timeout: int = 30, verbose: bool = True) None

Processes each row of a DataFrame to generate SMILES strings from XYZ files. Uses multiprocessing to handle files, with a timeout for each conversion. Saves the results to a CSV file at the specified output_csv_filepath.

Parameters:
  • df (pd.DataFrame) – DataFrame containing a column with XYZ file names.

  • xyz_dir (str) – Path to the directory containing the XYZ files.

  • label (Optional[str]) – Label to assign to the processed files.

  • output_csv_filepath (Path) – The full path where the processed SMILES CSV will be saved.

  • timeout (int) – Maximum time in seconds for each XYZ to SMILES conversion process. Defaults to 30.

  • verbose (bool) – If True, print per-row messages and a summary. Defaults to True.

MolecularDiffusion.runmodes.analyze.xyz2mol.sanitize_smiles(smiles: str) str | None

Sanitizes a single SMILES string and returns its canonical form.

Parameters:

smiles (str) – The SMILES string to sanitize.

Returns:

Canonical SMILES string if valid and sanitization succeeds,

otherwise None.

Return type:

Optional[str]