MolecularDiffusion.runmodes.data.ase_ops¶

ASE database operations module. Handles merging, inspecting, splitting, and sampling.

Attributes¶

`Chem`
`logger`

Functions¶

`inspect_db`(db_path[, output_dir, keys_to_plot, ...])	Inspects an ASE DB, printing stats and optionally plotting distributions.
`is_clean`(row)	Verifies that the atom order in ASE atoms and RDKit mol from mol_block are identical.
`merge_dbs`(input_dir, output_db[, recursive, pattern, ...])	Merges multiple ASE databases into one.
`rename_db_attribute`(db_path, old_name, new_name)	Renames a data attribute for all rows in an ASE database.
`sample_db`(input_db, output[, output_type, fraction, ...])	Samples a random fraction or number of entries from an ASE database.
`split_db`(db_path, output_dir[, n_splits])	Splits a DB into N smaller DBs.
`verify_datapoint`(atoms, mol_block)	Verifies that ASE Atoms match RDKit Mol block.

Module Contents¶

MolecularDiffusion.runmodes.data.ase_ops.inspect_db(db_path: pathlib.Path, output_dir: pathlib.Path = None, keys_to_plot: List[str] = None, sample_size: int = 5000, limit_print: int = 10, check_nan: bool = False, nan_key: str = None, discard_nan: bool = False, detect_outliers: bool = False, outlier_threshold: float = 3.0, discard_outliers: bool = False, outlier_key: str = None, clean_db_path: pathlib.Path = None)¶: Inspects an ASE DB, printing stats and optionally plotting distributions. Allows identifying NaNs and outliers, and optionally saving a cleaned DB.

MolecularDiffusion.runmodes.data.ase_ops.is_clean(row)¶: Verifies that the atom order in ASE atoms and RDKit mol from mol_block are identical.

MolecularDiffusion.runmodes.data.ase_ops.merge_dbs(input_dir: pathlib.Path, output_db: pathlib.Path, recursive: bool = False, pattern: str = '*.db', verify: bool = True)¶: Merges multiple ASE databases into one.

MolecularDiffusion.runmodes.data.ase_ops.rename_db_attribute(db_path: pathlib.Path, old_name: str, new_name: str)¶: Renames a data attribute for all rows in an ASE database.

MolecularDiffusion.runmodes.data.ase_ops.sample_db(input_db: pathlib.Path, output: pathlib.Path, output_type: str = 'db', fraction: float = None, number: int = None, seed: int = None, verify_clean: bool = False)¶

Samples a random fraction or number of entries from an ASE database.

output_type:: ‘db’ – write to an ASE SQLite database (default) ‘xyz’ – write one XYZ file per molecule into the output directory ‘npy’ – write positions.npy (M,N,3), numbers.npy (M,N), and

natoms.npy (M,) arrays into the output directory, where M is the number of sampled entries and N is padded to the maximum atom count in the sample.

MolecularDiffusion.runmodes.data.ase_ops.split_db(db_path: pathlib.Path, output_dir: pathlib.Path, n_splits: int = 2)¶: Splits a DB into N smaller DBs.

MolecularDiffusion.runmodes.data.ase_ops.verify_datapoint(atoms, mol_block)¶: Verifies that ASE Atoms match RDKit Mol block.

MolecularDiffusion.runmodes.data.ase_ops.Chem = None¶

MolecularDiffusion.runmodes.data.ase_ops.logger¶