Note

Go to the end to download the full example code.

Advanced usage of `ANIDataset`#

Example showing more involved conformer and property manipulation.

To begin with, let’s import the modules we will use:

import shutil
from pathlib import Path

import torch
import numpy as np

from torchani.datasets import ANIDataset, concatenate
from torchani.datasets.filters import filter_by_high_force

Again for the purposes of this example we will copy and modify two files inside torchani/dataset, which can be downloaded by running the download-dev-data.sh script.

file1_path = Path.cwd() / "file1.h5"
file2_path = Path.cwd() / "file2.h5"
data_source = Path.cwd().parent / "dev-data" / "hf-data" / "dataset" / "ani1-up_to_gdb4"
shutil.copy(data_source / "ani_gdb_s01.h5", file1_path)
shutil.copy(data_source / "ani_gdb_s02.h5", file2_path)
ds = ANIDataset(locations=(file1_path, file2_path), names=("file1", "file2"))

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 22it [00:00, 1561.41it/s]
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/torchani/datasets/anidataset.py:351: UserWarning: {'energiesHE', 'smiles', 'coordinatesHE'} found in legacy dataset, this will generate unpredictable issues.
 Probably .items() and .values() will work but not much else. It is highly  recommended that you backup these properties (if needed) and *delete them* using dataset.delete_properties
  warnings.warn(

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 92it [00:00, 1551.47it/s]

Property deletion / renaming#

All of the molecules in the dataset have the same properties, energies, coordinates, etc. You can query which are these.

ds.properties

{'species', 'coordinatesHE', 'energies', 'coordinates', 'energiesHE', 'smiles'}

It is possible to delete unwanted / unnedded properties.

ds.delete_properties(("coordinatesHE", "energiesHE", "smiles"))
ds.properties

Deleting properties:   0%|          | 0/3 [00:00<?, ?it/s]

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 13it [00:00, 2525.99it/s]

Deleting properties:   0%|          | 0/13 [00:00<?, ?it/s]

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 53it [00:00, 2499.64it/s]

{'energies', 'coordinates', 'species'}

It is also possible to rename the properties by passing a dict of old-new names (the class assumes at least one of “species” or “numbers” is always present, so don’t rename those).

ds.rename_properties({"energies": "molecular_energies", "coordinates": "coord"})
ds.properties

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 13it [00:00, 2547.35it/s]

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 53it [00:00, 2494.87it/s]

{'molecular_energies', 'coord', 'species'}

Lets rename them back to their original values:

ds.rename_properties({"molecular_energies": "energies", "coord": "coordinates"})
ds.properties

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 13it [00:00, 2548.42it/s]

Verifying format correctness: 0it [00:00, ?it/s]
Verifying format correctness: 53it [00:00, 2452.76it/s]

{'energies', 'coordinates', 'species'}

Grouping#

You can query whether your dataset is in a legacy format by interrogating the dataset grouping attribute

ds.grouping

'legacy'

Legacy format is the format used by some old datasets. In the legacy format there can be groups arbitrarily nested in the hierarchical tree inside the h5 files, and the “species”/”numbers” property does not have a batch dimension. This means all properties with an “atomic” dimension must be ordered the same way within a group (don’t worry too much if you don’t understand what this means, it basically means this is difficult to deal with)

We can convert to a less error prone and easier to parse format by calling “regroup_by_formula” or “regroup_by_num_atoms”

ds = ds.regroup_by_formula()
ds.grouping

Regrouping by formulas:   0%|          | 0/3 [00:00<?, ?it/s]
Regrouping by formulas:  67%|██████▋   | 2/3 [00:00<00:00, 18.08it/s]


Regrouping by formulas:   0%|          | 0/13 [00:00<?, ?it/s]
Regrouping by formulas:   8%|▊         | 1/13 [00:00<00:01,  7.86it/s]
Regrouping by formulas:  15%|█▌        | 2/13 [00:00<00:01,  8.71it/s]
Regrouping by formulas:  46%|████▌     | 6/13 [00:00<00:00, 17.48it/s]
Regrouping by formulas:  62%|██████▏   | 8/13 [00:00<00:00, 17.57it/s]
Regrouping by formulas:  92%|█████████▏| 12/13 [00:00<00:00, 21.19it/s]

'by_formula'

Another possibility is to group by num atoms

ds = ds.regroup_by_num_atoms()
ds.grouping

Regrouping by number of atoms:   0%|          | 0/3 [00:00<?, ?it/s]

Regrouping by number of atoms:   0%|          | 0/13 [00:00<?, ?it/s]

'by_num_atoms'

In these formats all of the first dimensions of all properties are the same in all groups, and groups can only have depth one. In other words the tree structure is, for “by_formula”

/C10H22/coordinates, shape (10, 32, 3)
       /species, shape (10, 32)
       /energies, shape (10,)
/C8H22N2/coordinates, shape (10, 32, 3)
       /species, shape (10, 32)
       /energies, shape (10,)
/C12H22/coordinates, shape (5, 34, 3)
       /species, shape (5, 34)
       /energies, shape (5,)

and for, “by_num_atoms”

/032/coordinates, shape (20, 32, 3)
     /species, shape (20, 32)
     /energies, shape (20,)
/034/coordinates, shape (5, 34, 3)
     /species, shape (5, 34)
     /energies, shape (5,)

Conformer groups can be iterated over in chunks, up to a specified maximum chunk size. This breaks a conformer group into mini-batches containing multiple inputs, allowing the dataset to be iterated over much more efficiently. As we regrouped the dataset by num_atoms in the previous step, this will iterate over conformer groups containing the same number of atoms.

with ds.keep_open("r") as read_ds:
    for group, j, conformer in read_ds.chunked_items(max_size=1500, limit=2):
        species = conformer["species"]
        coordinates = conformer["coordinates"]
        ani_input = (species, coordinates)
        print(ani_input)

(tensor([[8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        ...,
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1]]), tensor([[[    -0.000,     -0.006,      0.107],
         [     0.000,      0.777,     -0.421],
         [     0.000,     -0.678,     -0.343]],

        [[     0.000,     -0.005,      0.133],
         [    -0.000,      0.639,     -0.614],
         [     0.000,     -0.566,     -0.557]],

        [[     0.000,      0.001,      0.123],
         [    -0.000,      0.718,     -0.504],
         [    -0.000,     -0.727,     -0.511]],

        ...,

        [[    -0.000,     -0.005,      0.103],
         [     0.000,      0.838,     -0.385],
         [     0.000,     -0.756,     -0.321]],

        [[    -0.000,     -0.002,      0.109],
         [     0.000,      0.845,     -0.406],
         [     0.000,     -0.817,     -0.384]],

        [[     0.000,     -0.004,      0.132],
         [    -0.000,      0.795,     -0.606],
         [    -0.000,     -0.729,     -0.554]]]))
(tensor([[8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1],
        [8, 1, 1]]), tensor([[[     0.000,     -0.006,      0.125],
         [    -0.000,      0.703,     -0.560],
         [     0.000,     -0.609,     -0.487]],

        [[    -0.000,      0.005,      0.105],
         [     0.000,      0.871,     -0.335],
         [    -0.000,     -0.946,     -0.394]],

        [[     0.000,      0.007,      0.131],
         [    -0.000,      0.687,     -0.529],
         [    -0.000,     -0.794,     -0.612]],

        ...,

        [[     0.000,      0.005,      0.124],
         [    -0.000,      0.589,     -0.487],
         [    -0.000,     -0.667,     -0.548]],

        [[     0.000,      0.007,      0.131],
         [    -0.000,      0.666,     -0.528],
         [    -0.000,     -0.771,     -0.610]],

        [[     0.000,      0.001,      0.122],
         [    -0.000,      0.743,     -0.495],
         [    -0.000,     -0.764,     -0.512]]]))

Property creation#

Sometimes it may be useful to just create one placeholder property for some purpose. You can make the second dimension equal to the number of atoms in the group by setting is_atomic=True, and you can add also extra dims, for example, this creates a property with shape (N, A), for more examples see docstring of the function.

ds = ds.create_full_property(
    "new_property", is_atomic=True, fill_value=0.0, dtype=float
)
ds.properties

{'energies', 'coordinates', 'new_property', 'species'}

We now delete the created property for cleanup

ds.delete_properties("new_property", verbose=False)
ds.properties

{'energies', 'coordinates', 'species'}

Manipulating conformers#

All of the molecules in the dataset have the same properties Conformers as tensors can be appended by calling append_conformers. Here I put random numbers as species and coordinates but you should put something that makes sense, if you have only one store you can pass “group_name” directly.

conformers = {
    "species": torch.tensor([[1, 1, 6, 6], [1, 1, 6, 6]]),
    "coordinates": torch.randn(2, 4, 3),
    "energies": torch.randn(2),
}
ds.append_conformers("file1/004", conformers)

<torchani.datasets.anidataset.ANIDataset object at 0x7f8668574850>

It is also possible to append conformers as numpy arrays, in this case “species” can hold the chemical symbols or atomic numbers. Internally these will be converted to atomic numbers.

numpy_conformers = {
    "species": np.array(
        [["H", "H", "C", "N"], ["H", "H", "N", "O"], ["H", "H", "H", "H"]]
    ),
    "coordinates": np.random.standard_normal((3, 4, 3)),
    "energies": np.random.standard_normal(3),
}
ds.append_conformers("file1/004", numpy_conformers)

<torchani.datasets.anidataset.ANIDataset object at 0x7f8668574850>

Conformers can also be deleted from the dataset. Passing an index will delete a series of conformers, not passing anything deletes the whole group

molecules = ds.get_conformers("file1/004")
molecules

{'energies': tensor([-56.510, -56.502, -56.507,  ...,  -0.574,   1.753,  -1.390], dtype=torch.float64), 'coordinates': tensor([[[     0.020,      0.006,     -0.078],
         [     0.385,     -0.882,      0.067],
         [     0.318,      0.931,      0.038],
         [    -0.979,     -0.132,      0.169]],

        [[     0.003,     -0.015,     -0.143],
         [     0.533,     -0.736,      0.341],
         [     0.229,      0.820,      0.439],
         [    -0.803,      0.118,      0.400]],

        [[    -0.007,      0.010,     -0.095],
         [     0.566,     -0.902,      0.221],
         [     0.528,      0.938,      0.149],
         [    -0.991,     -0.180,      0.147]],

        ...,

        [[    -0.079,      0.240,      0.303],
         [     1.548,     -2.461,      0.356],
         [     0.650,     -0.553,      0.454],
         [     0.655,      1.217,     -0.737]],

        [[     0.049,     -0.418,     -0.013],
         [     0.407,      0.416,     -0.102],
         [    -1.112,      0.983,      0.001],
         [     0.792,      1.948,     -1.068]],

        [[    -0.569,      0.598,      0.254],
         [    -0.288,      1.291,      0.493],
         [    -1.150,     -0.020,      0.749],
         [     0.812,     -0.346,      0.844]]]), 'species': tensor([[7, 1, 1, 1],
        [7, 1, 1, 1],
        [7, 1, 1, 1],
        ...,
        [1, 1, 6, 7],
        [1, 1, 7, 8],
        [1, 1, 1, 1]])}

Lets delete some conformers and try again

ds.delete_conformers("file1/004", [0, 2])
molecules = ds.get_conformers("file1/004")

The len of the dataset has not changed

len(ds)

Lets get rid of the whole group

ds.delete_conformers("file1/004")
len(ds)

Currently, when appending the class checks:

That the first dimension of all your properties is the same
That you are appending a set of conformers with correct properties
That all your formulas are correct when the grouping type is “by_formula”,
That your group name does not contain illegal “/” characters
That you are only appending one of “species” or “numbers”

It does NOT check:

That the number of atoms is the same in all properties that are atomic
That the name of the group is consistent with the formula / num atoms

It is the responsibility of the user to make sure of those items.

Utilities#

Multiple datasets can be concatenated into one h5 file, optionally deleting the original h5 files if the concatenation is successful.

concat_path = Path.cwd() / "concat.h5"
ds = concatenate(ds, concat_path, delete_originals=True)

Concatenating datasets:   0%|          | 0/9 [00:00<?, ?it/s]
Concatenating datasets: 100%|██████████| 9/9 [00:00<00:00, 134.08it/s]

Deleting original stores:   0%|          | 0/2 [00:00<?, ?it/s]
Deleting original stores: 100%|██████████| 2/2 [00:00<00:00, 28630.06it/s]

Context manager usage#

If you need to perform a lot of read/write operations in the dataset it can be useful to keep all the underlying stores open, you can do this by using a keep_open context.

with ds.keep_open("r+") as open_ds:
    for c in open_ds.iter_conformers(limit=10):
        print(c)

{'energies': tensor(-109.491, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.527],
        [ 0.000,  0.000, -0.527]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.492, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.528],
        [ 0.000,  0.000, -0.528]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.494, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.571],
        [ 0.000,  0.000, -0.571]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.492, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.576],
        [ 0.000,  0.000, -0.576]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.493, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.574],
        [ 0.000,  0.000, -0.574]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.489, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.524],
        [ 0.000,  0.000, -0.524]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.491, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.578],
        [ 0.000,  0.000, -0.578]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.497, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.564],
        [ 0.000,  0.000, -0.564]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.497, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.541],
        [ 0.000,  0.000, -0.541]]), 'species': tensor([7, 7])}
{'energies': tensor(-109.488, dtype=torch.float64), 'coordinates': tensor([[ 0.000,  0.000,  0.524],
        [ 0.000,  0.000, -0.524]]), 'species': tensor([7, 7])}

Creating a dataset from scratch#

It is possible to create an ANIDataset from scratch by calling: By defalt the grouping is “by_num_atoms”. The first set of conformers you append will determine what properties this dataset will support.

new_path = Path.cwd() / "new_ds.h5"
new_ds = ANIDataset(new_path, grouping="by_formula")
numpy_conformers = {
    "species": np.array([["H", "H", "C", "C"], ["H", "C", "H", "C"]]),
    "coordinates": np.random.standard_normal((2, 4, 3)),
    "forces": np.random.normal(size=(2, 4, 3), scale=0.1),
    "dipoles": np.random.standard_normal((2, 3)),
    "energies": np.random.standard_normal(2),
}
new_ds.append_conformers("C2H2", numpy_conformers)
print(new_ds.properties)
for c in new_ds.iter_conformers():
    print(c)

{'forces', 'dipoles', 'species', 'energies', 'coordinates'}
{'forces': tensor([[ 0.074,  0.012, -0.001],
        [ 0.143, -0.082, -0.062],
        [-0.230, -0.033,  0.220],
        [-0.053,  0.263,  0.108]], dtype=torch.float64), 'coordinates': tensor([[-0.449, -0.234, -1.482],
        [ 0.830,  0.453,  0.788],
        [-1.106,  1.565, -0.750],
        [-0.392,  0.147,  0.042]], dtype=torch.float64), 'dipoles': tensor([-0.051,  1.833, -0.425], dtype=torch.float64), 'species': tensor([1, 1, 6, 6]), 'energies': tensor(-0.104, dtype=torch.float64)}
{'forces': tensor([[-0.068,  0.075, -0.050],
        [ 0.051, -0.026,  0.085],
        [ 0.069,  0.077, -0.153],
        [-0.036, -0.040, -0.053]], dtype=torch.float64), 'coordinates': tensor([[-0.458,  0.148,  0.567],
        [-0.214, -0.448,  0.309],
        [-0.870, -1.183,  0.436],
        [-0.192, -0.535, -0.206]], dtype=torch.float64), 'dipoles': tensor([-1.611,  0.334, -0.161], dtype=torch.float64), 'species': tensor([1, 6, 1, 6]), 'energies': tensor(-0.083, dtype=torch.float64)}

Another useful feature is deleting inplace all conformers with force magnitude above a given threshold, we will exemplify this by introducing some conformers with extremely large forces

bad_conformers = {
    "species": np.array([["H", "H", "N", "N"], ["H", "H", "N", "N"]]),
    "coordinates": np.random.standard_normal((2, 4, 3)),
    "forces": np.random.normal(size=(2, 4, 3), scale=100.0),
    "dipoles": np.random.standard_normal((2, 3)),
    "energies": np.random.standard_normal(2),
}
new_ds.append_conformers("C2H2", bad_conformers)
filtered_conformers_and_ids = filter_by_high_force(new_ds, delete_inplace=True)
filtered_conformers_and_ids

Filtering where any atomic force magnitude > 2.0 Ha / Angstrom: 0it [00:00, ?it/s]
Filtering where any atomic force magnitude > 2.0 Ha / Angstrom: 1it [00:00, 1195.64it/s]

Deleting filtered conformers:   0%|          | 0/1 [00:00<?, ?it/s]
Deleting filtered conformers: 100%|██████████| 1/1 [00:00<00:00, 200.75it/s]
Deleted 2 bad conformations

([{'forces': tensor([[[ -96.233,   -4.315,   -7.680],
         [  72.390,  -25.137,  -50.913],
         [ -31.533,  -91.594,   29.822],
         [ -11.371,  255.508, -127.901]],

        [[  15.175,    3.627,   99.934],
         [ -50.517,  152.605,  181.582],
         [ -72.742, -127.675,  -72.598],
         [  49.180,   23.597,   90.935]]], dtype=torch.float64), 'coordinates': tensor([[[-0.542, -0.209,  0.523],
         [ 1.027, -0.980,  0.272],
         [-1.081,  1.616,  0.220],
         [ 0.252, -0.662,  0.331]],

        [[ 1.095, -1.771, -1.033],
         [-1.392,  0.644,  1.144],
         [-0.989, -0.776, -0.786],
         [ 0.760,  0.233,  0.224]]], dtype=torch.float64), 'dipoles': tensor([[ 1.330, -1.121,  1.548],
        [ 0.922,  0.289,  0.348]], dtype=torch.float64), 'species': tensor([[1, 1, 7, 7],
        [1, 1, 7, 7]]), 'energies': tensor([ 0.277, -1.847], dtype=torch.float64)}], {'C2H2': tensor([2, 3])})

Finally, lets delete the files we used for cleanup

concat_path.unlink()
new_path.unlink()

Advanced usage of ANIDataset#