MolecularDiffusion.runmodes.analyze.xyz2mol¶
Functions¶
|
Sanitizes SMILES strings and computes molecular descriptors: |
|
Lists all .xyz files in the given directory and returns a DataFrame |
|
Main function to parse command-line arguments and initiate the processing |
|
Processes each row of a DataFrame to generate SMILES strings from XYZ files. |
|
Sanitizes a single SMILES string and returns its canonical form. |
Module Contents¶
- MolecularDiffusion.runmodes.analyze.xyz2mol.extract_scaffold_and_fingerprints(smiles_iter, fp_bits=2048)¶
Sanitizes SMILES strings and computes molecular descriptors: Morgan fingerprints, Bemis-Murcko scaffolds, and BRICS substructure counts.
- Parameters:
smiles_iter – An iterable of SMILES strings.
fp_bits (int) – Number of bits for the Morgan fingerprint.
- Returns:
- A tuple containing:
fps (np.ndarray): Array of Morgan fingerprints.
scaffolds (list): List of Bemis-Murcko scaffold SMILES.
clean_smiles (list): List of canonicalized and sanitized SMILES.
n_fail (int): Number of SMILES strings that failed processing.
substruct_counts (dict): Dictionary of BRICS substructure counts.
- Return type:
- MolecularDiffusion.runmodes.analyze.xyz2mol.load_file_list_from_dir(xyz_dir: str) pandas.DataFrame¶
Lists all .xyz files in the given directory and returns a DataFrame with a single column ‘xyz_file’.
- Parameters:
xyz_dir (str) – Path to the directory containing .xyz files.
- Returns:
DataFrame with ‘xyz_file’ column.
- Return type:
pd.DataFrame
- MolecularDiffusion.runmodes.analyze.xyz2mol.main()¶
Main function to parse command-line arguments and initiate the processing of XYZ files to generate SMILES strings, followed by fingerprint and scaffold extraction. All 2D representation outputs are saved in a ‘2d_reprs’ subdirectory within xyz_dir.
- MolecularDiffusion.runmodes.analyze.xyz2mol.run_processing(df: pandas.DataFrame, xyz_dir: str, label: str | None, output_csv_filepath: pathlib.Path, timeout: int = 30, verbose: bool = True) None¶
Processes each row of a DataFrame to generate SMILES strings from XYZ files. Uses multiprocessing to handle files, with a timeout for each conversion. Saves the results to a CSV file at the specified output_csv_filepath.
- Parameters:
df (pd.DataFrame) – DataFrame containing a column with XYZ file names.
xyz_dir (str) – Path to the directory containing the XYZ files.
label (Optional[str]) – Label to assign to the processed files.
output_csv_filepath (Path) – The full path where the processed SMILES CSV will be saved.
timeout (int) – Maximum time in seconds for each XYZ to SMILES conversion process. Defaults to 30.
verbose (bool) – If True, print per-row messages and a summary. Defaults to True.