torchani.datasets#
Functions and classes for creating batched and map-like datasets.
Backends for the on-disk map-like datasets inlcude HDF5, Apache Parquet and Zarr. For a tutorial introduction on the use of this API consult the User guide.
Built-in datasets are downloaded on first instantiation. Each built-in dataset that this module provides access to is calculated with a specific level of theory (LoT) which is in general specified as a combination of functional/basis_set or wavefunction_method/basis_set when appropriate.
Some of the provided built-in datasets have been published in ANI papers, and some are external freely available datasets published elsewhere, that have been reformatted to conform to TorchANI’s API. If you use any of these datasets in your work please cite the relevate article(s).
Functions
|
|
|
|
Combine all the backing stores in a given ANIDataset into one |
|
GDB subset, only for debugging and code test purposes |
|
Only for debugging and test purposes, includes forces, dipoles and charges |
|
Only for debugging and code testing purposes, includes forces and dipoles |
|
Dataset that includes ions, with H,C,N,O,F,S,Cl elements and at least one of Si,As,Br,Se,P,B,I (disjoint from LightIons and IonsHeavy) This dataset is not meant to be trained to on its own |
|
WARNING: This dataset may have incorrect energies and/or forces. |
|
WARNING: This dataset may have incorrect energies and/or forces. |
|
Very limited subset of ANI-1x for which 'atomic CM5 charges' are available. |
|
Subset of ANI-2x 'heavy' for which 'atomic CM5 charges' are available. |
|
This dataset also has Hartree Fock (HF) energies, RI-MP2 energies and forces and DPLNO-CCSD(T) energies for different basis sets and PNO settings. |
|
Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. |
|
In all cases the 'v2' and '2x' datasets are supersets of the 'v1' and '1x' datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are 'v1' values but not 'v2' values. |
|
Test set, not meant for direct training. |
|
Test set, not meant for direct training. |
|
Structures corresponding to all smiles extracted from the ANI-1x dataset, embedded in 3D space and optimized with PM7. |
Classes
This dataset does not support multiprocessing or pin_memory=True in dataloader (num_workers>0) |
|
Dataset that supports multiple stores and manages them as one single entity. |
|
|
- class torchani.datasets.ANIBatchedDataset(store_dir, split='division', transform=Identity(), limit=1.0, properties=(), drop_last=False)[source]#
- class torchani.datasets.ANIBatchedInMemoryDataset(batches, transform=Identity(), limit=1.0, split='division', drop_last=False)[source]#
This dataset does not support multiprocessing or pin_memory=True in dataloader (num_workers>0)
- class torchani.datasets.ANIDataset(locations, names=None, **kwargs)[source]#
Dataset that supports multiple stores and manages them as one single entity.
Datasets have a “grouping” for the different conformers, which can be “by_formula”, “by_num_atoms”, “legacy”. Regrouping to one of the standard groupings can be done using ‘regroup_by_formula’ or ‘regroup_by_num_atoms’.
Conformers can be extracted as {property: Tensor} or {property: ndarray} dicts, and can also be appended or deleted from the backing stores.
All conformers in a datasets must have the same properties and the first dimension in all Tensors/arrays is the same for all conformer groups (it is the batch dimension). Property manipulation (renaming, deleting, adding) is also supported.
- torchani.datasets.concatenate(source, dest_location, verbose=True, backend='hdf5', delete_originals=False)[source]#
Combine all the backing stores in a given ANIDataset into one
- torchani.datasets.TestData(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
GDB subset, only for debugging and code test purposes
- torchani.datasets.TestDataIons(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Only for debugging and test purposes, includes forces, dipoles and charges
- torchani.datasets.TestDataForcesDipoles(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Only for debugging and code testing purposes, includes forces and dipoles
- torchani.datasets.IonsVeryHeavy(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Dataset that includes ions, with H,C,N,O,F,S,Cl elements and at least one of Si,As,Br,Se,P,B,I (disjoint from LightIons and IonsHeavy) This dataset is not meant to be trained to on its own
- torchani.datasets.IonsHeavy(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
WARNING: This dataset may have incorrect energies and/or forces. Dataset that includes ions, with H,C,N,O elements and at least one of F,S,Cl (disjoint from IonsLight) This dataset is not meant to be trained to on its own
- torchani.datasets.IonsLight(lot='b973c-def2mtzvp', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
WARNING: This dataset may have incorrect energies and/or forces. Dataset that includes ions, with H,C,N,O elements only. Not meant to be trained to on its own
- torchani.datasets.ANI1q(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Very limited subset of ANI-1x for which ‘atomic CM5 charges’ are available. This dataset is not meant to be trained to on its own. Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’
- torchani.datasets.ANI2qHeavy(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Subset of ANI-2x ‘heavy’ for which ‘atomic CM5 charges’ are available. This dataset is not meant to be trained to on its own. Originally published in TODO. DOI: ‘TODO’
- torchani.datasets.ANI1ccx(lot='ccsd(t)star-cbs', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
This dataset also has Hartree Fock (HF) energies, RI-MP2 energies and forces and DPLNO-CCSD(T) energies for different basis sets and PNO settings. This dataset was originally used for transfer learning, not direct training. Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’
- torchani.datasets.ANI1x(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Originally published in The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. DOI: ‘10.1038/s41597-020-0473-z’
- torchani.datasets.ANI2x(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
In all cases the ‘v2’ and ‘2x’ datasets are supersets of the ‘v1’ and ‘1x’ datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are ‘v1’ values but not ‘v2’ values. Originally published in TODO. DOI: ‘TODO’
- torchani.datasets.COMP6v1(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Test set, not meant for direct training. In all cases the ‘v2’ and ‘2x’ datasets are supersets of the ‘v1’ and ‘1x’ datasets, so everything that is in the v1/1x datasets is also in the v2/2x datasets, which contain extra structures, except for some wb97X/def2-TZVPP data points for which there are ‘v1’ values but not ‘v2’ values. Note that the ANI-BenchMD, S66x8 and the
13
molecules (with 13 heavy atoms) of GDB-10to13 were recalculated using ORCA 5.0 instead of 4.2, with default integration grids. The numerical difference is not significant for the purposes of training. Originally published in TODO. DOI: ‘TODO’
- torchani.datasets.COMP6v2(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Test set, not meant for direct training. Note that the ANI-BenchMD, S66x8 and the
13
molecules (with 13 heavy atoms) of GDB-10to13 were recalculated using ORCA 5.0 instead of 4.2, with default integration grids. The numerical difference is not significant for the purposes of training. Originally published in TODO. DOI: ‘TODO’
- torchani.datasets.ANI1e(lot='wb97x-631gd', verbose=True, download=True, dummy_properties=None, skip_check=False)[source]#
Structures corresponding to all smiles extracted from the ANI-1x dataset, embedded in 3D space and optimized with PM7. This dataset does not have forces, but it has other physical properties: Rotational constants A, B, C (GHz); dipole and polarizability magnitudes (Debye and a_0^3 respectively); energy of HOMO and LUMO, and HOMO-LUMO gap (Ha); average <r^2> (spatial extent, a_0^2); zero point vibrational energies (ZPVE, Ha); zero Kelvin internal energy (Ha); thermal quantities, U, H, G, C_v, at 298.15 K (C_v in cal/K/mol, rest in Ha). Originally published in ANI-1E: An equilibrium database from the ANI-1 database. DOI: ‘TODO’
Modules
Filters to remove unwanted structures from datasets |