MolecularDiffusion.data.component.pharmacophore

Pharmacophore Dataset for ShEPhERD integration.

Faithful port of shepherd/src/shepherd/datasets.py HeteroDataset. Handles loading, processing, and forward-noising of 4 molecular modalities: - x1: Molecular structure (atoms + bonds) - x2: Shape (surface point cloud) - x3: Electrostatics (surface point cloud + ESP) - x4: Pharmacophores (points + directions)

Forward-noising is applied per-sample in __getitem__ using the noise schedule from the ShEPhERD checkpoint hyperparameters.

Attributes

Classes

PharmacophoreDataModule

DataModule wrapper for PharmacophoreDataset.

PharmacophoreDataset

Dataset for ShEPhERD pharmacophore diffusion.

Functions

build_default_noise_schedule([T])

Build the default ShEPhERD cosine+linear noise schedule.

pharmacophore_collate_fn(→ torch_geometric.data.HeteroData)

Collate HeteroData samples into a batch.

sample_timestep_biased(ts)

Biased timestep sampling matching ShEPhERD:

Module Contents

class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataModule(pkl_files: List[str], root: str = 'data/pharmacophore_cache', dataset_name: str = 'pharmacophore', num_samples: int | None = None, data_fraction: float = 1.0, batch_size: int = 32, num_workers: int = 0, train_ratio: float = 0.8, task_type: str = 'diffusion_pharmacophore', checkpoint_path: str | None = None, **kwargs)

DataModule wrapper for PharmacophoreDataset.

Loads molblocks+charges from pkl files, builds noise schedule (from checkpoint or default), and creates train/val/test splits.

load()

Load pkl data and create dataset with proper noise schedule.

batch_size = 32
checkpoint_path = None
collate_fn
data_fraction = 1.0
dataset_name = 'pharmacophore'
kwargs
num_samples = None
num_workers = 0
pkl_files
root = 'data/pharmacophore_cache'
task_type = 'diffusion_pharmacophore'
test_set = None
train_ratio = 0.8
train_set = None
valid_set = None
class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataset(molblocks_and_charges: List, noise_schedule_dict: Dict[str, Dict], compute_x1: bool = True, compute_x2: bool = False, compute_x3: bool = True, compute_x4: bool = True, recenter_x1: bool = True, add_virtual_node_x1: bool = True, remove_noise_COM_x1: bool = True, atom_types_x1: List = None, charge_types_x1: List = None, bond_types_x1: List = None, scale_atom_features_x1: float = 0.25, scale_bond_features_x1: float = 1.0, formal_charge_diffusion: bool = True, independent_timesteps_x2: bool = False, recenter_x2: bool = False, add_virtual_node_x2: bool = True, remove_noise_COM_x2: bool = False, num_points_x2: int = 75, independent_timesteps_x3: bool = False, recenter_x3: bool = False, add_virtual_node_x3: bool = True, remove_noise_COM_x3: bool = False, num_points_x3: int = 75, scale_node_features_x3: float = 2.0, independent_timesteps_x4: bool = False, recenter_x4: bool = False, add_virtual_node_x4: bool = True, remove_noise_COM_x4: bool = False, max_node_types_x4: int = 10, scale_node_features_x4: float = 2.0, scale_vector_features_x4: float = 2.0, multivectors: bool = False, check_accessibility: bool = False, probe_radius: float = 0.6)

Bases: torch.utils.data.Dataset

Dataset for ShEPhERD pharmacophore diffusion.

Each __getitem__ returns a torch_geometric.data.HeteroData with forward-noised fields for x1, x2, x3, x4 — exactly matching shepherd/src/shepherd/datasets.py HeteroDataset.

get_x1_data(mol, t, alpha_dash_t, sigma_dash_t)
get_x2_data(radii, atom_centers, num_points, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)
get_x3_data_electrostatics_only(charges, charge_centers, data, pos, virtual_node_mask, t, alpha_dash_t, sigma_dash_t)
get_x4_data(mol, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)
add_virtual_node_x1 = True
add_virtual_node_x2 = True
add_virtual_node_x3 = True
add_virtual_node_x4 = True
atom_types_x1 = [None, 'H', 'C', 'N', 'O', 'F', 'Cl', 'Br', 'I', 'S', 'P', 'Si']
bond_types_x1 = [None, 'SINGLE', 'DOUBLE', 'TRIPLE', 'AROMATIC']
charge_types_x1
check_accessibility = False
compute_x1 = True
compute_x2 = False
compute_x3 = True
compute_x4 = True
formal_charge_diffusion = True
independent_timesteps_x2 = False
independent_timesteps_x3 = False
independent_timesteps_x4 = False
max_node_types_x4 = 10
molblocks_and_charges
multivectors = False
noise_schedule_dict
num_points_x2 = 75
num_points_x3 = 75
probe_radius = 0.6
recenter_x1 = True
recenter_x2 = False
recenter_x3 = False
recenter_x4 = False
remove_noise_COM_x1 = True
remove_noise_COM_x2 = False
remove_noise_COM_x3 = False
remove_noise_COM_x4 = False
scale_atom_features_x1 = 0.25
scale_bond_features_x1 = 1.0
scale_node_features_x3 = 2.0
scale_node_features_x4 = 2.0
scale_vector_features_x4 = 2.0
MolecularDiffusion.data.component.pharmacophore.build_default_noise_schedule(T=400)

Build the default ShEPhERD cosine+linear noise schedule. Matches shepherd/training/parameters/params_x1x3x4_diffusion_mosesaq_20240824.py

MolecularDiffusion.data.component.pharmacophore.pharmacophore_collate_fn(batch: List[torch_geometric.data.HeteroData]) torch_geometric.data.HeteroData

Collate HeteroData samples into a batch. Uses torch_geometric’s built-in Batch.from_data_list for HeteroData.

MolecularDiffusion.data.component.pharmacophore.sample_timestep_biased(ts)

Biased timestep sampling matching ShEPhERD: - 7.5% from high-noise end (t=1..50) - 75% from middle (t=50..250) - 17.5% from low-noise start (t=250..400)

MolecularDiffusion.data.component.pharmacophore.Chem = None
MolecularDiffusion.data.component.pharmacophore.logger
MolecularDiffusion.data.component.pharmacophore.o3d = None