torchani.datasets#

Functions and classes for creating batched and map-like datasets.

Backends for the on-disk map-like datasets inlcude HDF5, Apache Parquet and Zarr. For a tutorial introduction on the use of this API consult the User guide.

Built-in datasets are downloaded on first instantiation. Each built-in dataset that this module provides access to is calculated with a specific level of theory (LoT) which is in general specified as a combination of functional/basis_set or wavefunction_method/basis_set when appropriate.

Some of the provided built-in datasets have been published in ANI papers, and some are external freely available datasets published elsewhere, that have been reformatted to conform to TorchANI’s API. If you use any of these datasets in your work please cite the relevate article(s).

Functions

create_batched_dataset

batch_all_in_ram

concatenate

Combine all the backing stores in a given ANIDataset into one

TestData

GDB subset, only for debugging and code test purposes

TestDataIons

Only for debugging and test purposes, includes forces, dipoles and charges

TestDataForcesDipoles

Only for debugging and code testing purposes, includes forces and dipoles

IonsVeryHeavy

Dataset that includes ions, with H,C,N,O,F,S,Cl elements and at least one of Si,As,Br,Se,P,B,I (disjoint from LightIons and IonsHeavy) This dataset is not meant to be trained to on its own

IonsHeavy

WARNING: This dataset may have incorrect energies and/or forces.

IonsLight

WARNING: This dataset may have incorrect energies and/or forces.

ANI1q

Very limited subset of ANI-1x for which 'atomic CM5 charges' are available.

ANI2qHeavy

Subset of ANI-2x 'heavy' for which 'atomic CM5 charges' are available.

ANI1ccx

This dataset also has Hartree Fock (HF) energies, RI-MP2 energies and forces and DPLNO-CCSD(T) energies for different basis sets and PNO settings.

ANI1x

Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules.

ANI2x

In all cases the 'v2' and '2x' datasets are supersets of the 'v1' and '1x' datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are 'v1' values but not 'v2' values.

COMP6v1

Test set, not meant for direct training.

COMP6v2

Test set, not meant for direct training.

ANI1e

Structures corresponding to all smiles extracted from the ANI-1x dataset, embedded in 3D space and optimized with PM7.

Classes

BatchedDataset

ANIBatchedDataset

ANIBatchedInMemoryDataset

This dataset does not support multiprocessing or pin_memory=True in dataloader (num_workers>0)

ANIDataset

Dataset that supports multiple stores and manages them as one single entity.

Batcher

class torchani.datasets.BatchedDataset[source]#
class torchani.datasets.ANIBatchedDataset(store_dir, split='division', transform=Identity(), limit=1.0, properties=(), drop_last=False)[source]#
cache(verbose=True, pin_memory=None)[source]#

Saves the full dataset into RAM

class torchani.datasets.ANIBatchedInMemoryDataset(batches, transform=Identity(), limit=1.0, split='division', drop_last=False)[source]#

This dataset does not support multiprocessing or pin_memory=True in dataloader (num_workers>0)

class torchani.datasets.ANIDataset(locations, names=None, **kwargs)[source]#

Dataset that supports multiple stores and manages them as one single entity.

Datasets have a “grouping” for the different conformers, which can be “by_formula”, “by_num_atoms”, “legacy”. Regrouping to one of the standard groupings can be done using ‘regroup_by_formula’ or ‘regroup_by_num_atoms’.

Conformers can be extracted as {property: Tensor} or {property: ndarray} dicts, and can also be appended or deleted from the backing stores.

All conformers in a datasets must have the same properties and the first dimension in all Tensors/arrays is the same for all conformer groups (it is the batch dimension). Property manipulation (renaming, deleting, adding) is also supported.

classmethod from_dir(dir_, **kwargs)[source]#

Initializes datasets from all files in a given directory

File backends are inferred from the suffixes. All files must have different names.

torchani.datasets.concatenate(source, dest_location, verbose=True, backend='hdf5', delete_originals=False)[source]#

Combine all the backing stores in a given ANIDataset into one

torchani.datasets.TestData(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

GDB subset, only for debugging and code test purposes

torchani.datasets.TestDataIons(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Only for debugging and test purposes, includes forces, dipoles and charges

torchani.datasets.TestDataForcesDipoles(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Only for debugging and code testing purposes, includes forces and dipoles

torchani.datasets.IonsVeryHeavy(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Dataset that includes ions, with H,C,N,O,F,S,Cl elements and at least one of Si,As,Br,Se,P,B,I (disjoint from LightIons and IonsHeavy) This dataset is not meant to be trained to on its own

torchani.datasets.IonsHeavy(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

WARNING: This dataset may have incorrect energies and/or forces. Dataset that includes ions, with H,C,N,O elements and at least one of F,S,Cl (disjoint from IonsLight) This dataset is not meant to be trained to on its own

torchani.datasets.IonsLight(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

WARNING: This dataset may have incorrect energies and/or forces. Dataset that includes ions, with H,C,N,O elements only. Not meant to be trained to on its own

torchani.datasets.ANI1q(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Very limited subset of ANI-1x for which ‘atomic CM5 charges’ are available. This dataset is not meant to be trained to on its own. Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’

torchani.datasets.ANI2qHeavy(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Subset of ANI-2x ‘heavy’ for which ‘atomic CM5 charges’ are available. This dataset is not meant to be trained to on its own. Originally published in TODO. DOI: ‘TODO’

torchani.datasets.ANI1ccx(lot='ccsd(t)star-cbs', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

This dataset also has Hartree Fock (HF) energies, RI-MP2 energies and forces and DPLNO-CCSD(T) energies for different basis sets and PNO settings. This dataset was originally used for transfer learning, not direct training. Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’

torchani.datasets.ANI1x(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’

torchani.datasets.ANI2x(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

In all cases the ‘v2’ and ‘2x’ datasets are supersets of the ‘v1’ and ‘1x’ datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are ‘v1’ values but not ‘v2’ values. Originally published in TODO. DOI: ‘TODO’

torchani.datasets.COMP6v1(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Test set, not meant for direct training. In all cases the ‘v2’ and ‘2x’ datasets are supersets of the ‘v1’ and ‘1x’ datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are ‘v1’ values but not ‘v2’ values. Note that the ANI-BenchMD, S66x8 and the 13 molecules (with 13 heavy atoms) of GDB-10to13 were recalculated using ORCA 5.0 instead of 4.2, with default integration grids. The numerical difference is not significant for the purposes of training. Originally published in TODO. DOI: ‘TODO’

torchani.datasets.COMP6v2(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Test set, not meant for direct training. Note that the ANI-BenchMD, S66x8 and the 13 molecules (with 13 heavy atoms) of GDB-10to13 were recalculated using ORCA 5.0 instead of 4.2, with default integration grids. The numerical difference is not significant for the purposes of training. Originally published in TODO. DOI: ‘TODO’

torchani.datasets.ANI1e(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#

Structures corresponding to all smiles extracted from the ANI-1x dataset, embedded in 3D space and optimized with PM7. This dataset does not have forces, but it has other physical properties: Rotational constants A, B, C (GHz); dipole and polarizability magnitudes (Debye and a_0^3 respectively); energy of HOMO and LUMO, and HOMO-LUMO gap (Ha); average <r^2> (spatial extent, a_0^2); zero point vibrational energies (ZPVE, Ha); zero Kelvin internal energy (Ha); thermal quantities, U, H, G, C_v, at 298.15 K (C_v in cal/K/mol, rest in Ha). Originally published in ANI-1E: An equilibrium database from the ANI-1 database. DOI: ‘TODO’

Modules

filters

Filters to remove unwanted structures from datasets