Data Format

Data Shapes

All data passed to a network in Brainstorm by a data iterator must match the template (T, B, ...) where T is the maximum sequence length and B is the number of sequences (or batch size, in other words).

To simplify handling both sequential and non-sequential data, these shapes should also be used when the data is not sequential. In such cases the shape simply becomes (1, B, ...). As an example, the MNIST training images for classification with an MLP should be shaped (1, 60000, 784) and the corresponding targets should be shaped (1, 60000, 1).

Data for images/videos should be stored in the TNHWC format. For example, the training images for CIFAR-10 should be shaped (1, 50000, 32, 32, 3) and the targets should be shaped (1, 50000, 1).

Example

A network in brainstorm accepts a dictionary of named data items as input. The keys of this dictionary and the shapes of the data should match those which were specified when the network was built.

Consider a simple network built as follows:

import numpy as np
from brainstorm import Network, layers

inp = layers.Input({'my_inputs': ('T', 'B', 50),
                    'my_targets': ('T','B', 2)})
hid = layers.FullyConnected(100, name='Hidden')
out = layers.SoftmaxCE(name='Output')
loss = layers.Loss()
inp - 'my_inputs' >> hid >> out
inp - 'my_targets' >> 'targets' - out - 'loss' >> loss
network = Network.from_layer(loss)

The same network can be quickly build

Here’s how you can provide some data to a network in brainstorm and run a forward pass on it.

File Format

There is no requirement on how to store the data in brainstorm, but we highly recommend the HDF5 format using the h5py library.

It is very simple to create hdf5 files:

import h5py
import numpy as np

with h5py.File('demo.hdf5', 'w') as f:
    f['training/input_data'] = np.random.randn(7, 100, 15)
    f['training/targets'] = np.random.randn(7, 100, 2)
    f['training/static_data'] = np.random.randn(1, 100, 4)

Having such a file available you can then set-up your data iterator like this:

import h5py
import brainstorm as bs

ds = h5py.File('demo.hdf5', 'r')

online_train_iter = bs.Online(**ds['training'])
minibatch_train_iter = bs.Minibatches(100, **ds['training'])

These iterators will then provide named data items (a dictionary) to the network with names ‘input_data’, ‘targets’ and ‘static_data’.

H5py offers many more features, which can be utilized to improve data storage and access such as chunking and compression.