Homogenous storage classes

The Array class

class tables.Array(parentnode, name, obj=None, title='', byteorder=None, _log=True, _atom=None, track_times=True)[source]

This class represents homogeneous datasets in an HDF5 file.

This class provides methods to write or read data to or from array objects in the file. This class does not allow you neither to enlarge nor compress the datasets on disk; use the EArray class (see The EArray class) if you want enlargeable dataset support or compression features, or CArray (see The CArray class) if you just want compression.

An interesting property of the Array class is that it remembers the flavor of the object that has been saved so that if you saved, for example, a list, you will get a list during readings afterwards; if you saved a NumPy array, you will get a NumPy object, and so forth.

Note that this class inherits all the public attributes and methods that Leaf (see The Leaf class) already provides. However, as Array instances have no internal I/O buffers, it is not necessary to use the flush() method they inherit from Leaf in order to save their internal state to disk. When a writing method call returns, all the data is already on disk.

Parameters:
  • parentnode

    The parent Group object.

    Changed in version 3.0: Renamed from parentNode to parentnode

  • name (str) – The name of this node in its parent group.

  • obj

    The array or scalar to be saved. Accepted types are NumPy arrays and scalars as well as native Python sequences and scalars, provided that values are regular (i.e. they are not like [[1,2],2]) and homogeneous (i.e. all the elements are of the same type).

    Changed in version 3.0: Renamed form object into obj.

  • title – A description for this node (it sets the TITLE HDF5 attribute on disk).

  • byteorder – The byteorder of the data on disk, specified as ‘little’ or ‘big’. If this is not specified, the byteorder is that of the given object.

  • track_times

    Whether time data associated with the leaf are recorded (object access time, raw data modification time, metadata change time, object birth time); default True. Semantics of these times depend on their implementation in the HDF5 library: refer to documentation of the H5O_info_t data structure. As of HDF5 1.8.15, only ctime (metadata change time) is implemented.

    New in version 3.4.3.

Array instance variables

Array.atom

An Atom (see The Atom class and its descendants) instance representing the type and shape of the atomic objects to be saved.

Array.rowsize

The size of the rows in bytes in dimensions orthogonal to maindim.

Array.nrow

On iterators, this is the index of the current row.

Array.nrows

The number of rows in the array.

Array methods

Array.get_enum()[source]

Get the enumerated type associated with this array.

If this array is of an enumerated type, the corresponding Enum instance (see The Enum class) is returned. If it is not of an enumerated type, a TypeError is raised.

Array.iterrows(start=None, stop=None, step=None)[source]

Iterate over the rows of the array.

This method returns an iterator yielding an object of the current flavor for each selected row in the array. The returned rows are taken from the main dimension.

If a range is not supplied, all the rows in the array are iterated upon - you can also use the Array.__iter__() special method for that purpose. If you only want to iterate over a given range of rows in the array, you may use the start, stop and step parameters.

Examples

result = [row for row in arrayInstance.iterrows(step=4)]

Changed in version 3.0: If the start parameter is provided and stop is None then the array is iterated from start to the last line. In PyTables < 3.0 only one element was returned.

Array.__next__()[source]

Get the next element of the array during an iteration.

The element is returned as an object of the current flavor.

Array.read(start=None, stop=None, step=None, out=None)[source]

Get data in the array as an object of the current flavor.

The start, stop and step parameters can be used to select only a range of rows in the array. Their meanings are the same as in the built-in range() Python function, except that negative values of step are not allowed yet. Moreover, if only start is specified, then stop will be set to start + 1. If you do not specify neither start nor stop, then all the rows in the array are selected.

The out parameter may be used to specify a NumPy array to receive the output data. Note that the array must have the same size as the data selected with the other parameters. Note that the array’s datatype is not checked and no type casting is performed, so if it does not match the datatype on disk, the output will not be correct. Also, this parameter is only valid when the array’s flavor is set to ‘numpy’. Otherwise, a TypeError will be raised.

When data is read from disk in NumPy format, the output will be in the current system’s byteorder, regardless of how it is stored on disk. The exception is when an output buffer is supplied, in which case the output will be in the byteorder of that output buffer.

Changed in version 3.0: Added the out parameter.

Array special methods

The following methods automatically trigger actions when an Array instance is accessed in a special way (e.g. array[2:3,...,::2] will be equivalent to a call to array.__getitem__((slice(2, 3, None), Ellipsis, slice(None, None, 2)))).

Array.__getitem__(key)[source]

Get a row, a range of rows or a slice from the array.

The set of tokens allowed for the key is the same as that for extended slicing in Python (including the Ellipsis or … token). The result is an object of the current flavor; its shape depends on the kind of slice used as key and the shape of the array itself.

Furthermore, NumPy-style fancy indexing, where a list of indices in a certain axis is specified, is also supported. Note that only one list per selection is supported right now. Finally, NumPy-style point and boolean selections are supported as well.

Examples

array1 = array[4]                       # simple selection
array2 = array[4:1000:2]                # slice selection
array3 = array[1, ..., ::2, 1:4, 4:]    # general slice selection
array4 = array[1, [1,5,10], ..., -1]    # fancy selection
array5 = array[np.where(array[:] > 4)]  # point selection
array6 = array[array[:] > 4]            # boolean selection
Array.__iter__()[source]

Iterate over the rows of the array.

This is equivalent to calling Array.iterrows() with default arguments, i.e. it iterates over all the rows in the array.

Examples

result = [row[2] for row in array]

Which is equivalent to:

result = [row[2] for row in array.iterrows()]
Array.__setitem__(key, value)[source]

Set a row, a range of rows or a slice in the array.

It takes different actions depending on the type of the key parameter: if it is an integer, the corresponding array row is set to value (the value is broadcast when needed). If key is a slice, the row slice determined by it is set to value (as usual, if the slice to be updated exceeds the actual shape of the array, only the values in the existing range are updated).

If value is a multidimensional object, then its shape must be compatible with the shape determined by key, otherwise, a ValueError will be raised.

Furthermore, NumPy-style fancy indexing, where a list of indices in a certain axis is specified, is also supported. Note that only one list per selection is supported right now. Finally, NumPy-style point and boolean selections are supported as well.

Examples

a1[0] = 333        # assign an integer to a Integer Array row
a2[0] = 'b'        # assign a string to a string Array row
a3[1:4] = 5        # broadcast 5 to slice 1:4
a4[1:4:2] = 'xXx'  # broadcast 'xXx' to slice 1:4:2

# General slice update (a5.shape = (4,3,2,8,5,10).
a5[1, ..., ::2, 1:4, 4:] = numpy.arange(1728, shape=(4,3,2,4,3,6))
a6[1, [1,5,10], ..., -1] = arr    # fancy selection
a7[np.where(a6[:] > 4)] = 4       # point selection + broadcast
a8[arr > 4] = arr2                # boolean selection

The CArray class

class tables.CArray(parentnode, name, atom=None, shape=None, title='', filters=None, chunkshape=None, byteorder=None, _log=True, track_times=True)[source]

This class represents homogeneous datasets in an HDF5 file.

The difference between a CArray and a normal Array (see The Array class), from which it inherits, is that a CArray has a chunked layout and, as a consequence, it supports compression. You can use datasets of this class to easily save or load arrays to or from disk, with compression support included.

CArray includes all the instance variables and methods of Array. Only those with different behavior are mentioned here.

Parameters:
  • parentnode

    The parent Group object.

    Changed in version 3.0: Renamed from parentNode to parentnode.

  • name (str) – The name of this node in its parent group.

  • atom – An Atom instance representing the type and shape of the atomic objects to be saved.

  • shape – The shape of the new array.

  • title – A description for this node (it sets the TITLE HDF5 attribute on disk).

  • filters – An instance of the Filters class that provides information about the desired I/O filters to be applied during the life of this object.

  • chunkshape – The shape of the data chunk to be read or written in a single HDF5 I/O operation. Filters are applied to those chunks of data. The dimensionality of chunkshape must be the same as that of shape. If None, a sensible value is calculated (which is recommended).

  • byteorder – The byteorder of the data on disk, specified as ‘little’ or ‘big’. If this is not specified, the byteorder is that of the platform.

  • track_times

    Whether time data associated with the leaf are recorded (object access time, raw data modification time, metadata change time, object birth time); default True. Semantics of these times depend on their implementation in the HDF5 library: refer to documentation of the H5O_info_t data structure. As of HDF5 1.8.15, only ctime (metadata change time) is implemented.

    New in version 3.4.3.

Examples

See below a small example of the use of the CArray class. The code is available in examples/carray1.py:

import numpy as np
import tables as tb

fileName = 'carray1.h5'
shape = (200, 300)
atom = tb.UInt8Atom()
filters = tb.Filters(complevel=5, complib='zlib')

h5f = tb.open_file(fileName, 'w')
ca = h5f.create_carray(h5f.root, 'carray', atom, shape,
                       filters=filters)

# Fill a hyperslab in ``ca``.
ca[10:60, 20:70] = np.ones((50, 50))
h5f.close()

# Re-open a read another hyperslab
h5f = tb.open_file(fileName)
print(h5f)
print(h5f.root.carray[8:12, 18:22])
h5f.close()

The output for the previous script is something like:

carray1.h5 (File) ''
Last modif.: 'Thu Apr 12 10:15:38 2007'
Object Tree:
/ (RootGroup) ''
/carray (CArray(200, 300), shuffle, zlib(5)) ''

[[0 0 0 0]
 [0 0 0 0]
 [0 0 1 1]
 [0 0 1 1]]

The EArray class

class tables.EArray(parentnode, name, atom=None, shape=None, title='', filters=None, expectedrows=None, chunkshape=None, byteorder=None, _log=True, track_times=True)[source]

This class represents extendable, homogeneous datasets in an HDF5 file.

The main difference between an EArray and a CArray (see The CArray class), from which it inherits, is that the former can be enlarged along one of its dimensions, the enlargeable dimension. That means that the Leaf.extdim attribute (see Leaf) of any EArray instance will always be non-negative. Multiple enlargeable dimensions might be supported in the future.

New rows can be added to the end of an enlargeable array by using the EArray.append() method.

Parameters:
  • parentnode

    The parent Group object.

    Changed in version 3.0: Renamed from parentNode to parentnode.

  • name (str) – The name of this node in its parent group.

  • atom – An Atom instance representing the type and shape of the atomic objects to be saved.

  • shape – The shape of the new array. One (and only one) of the shape dimensions must be 0. The dimension being 0 means that the resulting EArray object can be extended along it. Multiple enlargeable dimensions are not supported right now.

  • title – A description for this node (it sets the TITLE HDF5 attribute on disk).

  • filters – An instance of the Filters class that provides information about the desired I/O filters to be applied during the life of this object.

  • expectedrows – A user estimate about the number of row elements that will be added to the growable dimension in the EArray node. If not provided, the default value is EXPECTED_ROWS_EARRAY (see tables/parameters.py). If you plan to create either a much smaller or a much bigger EArray try providing a guess; this will optimize the HDF5 B-Tree creation and management process time and the amount of memory used.

  • chunkshape – The shape of the data chunk to be read or written in a single HDF5 I/O operation. Filters are applied to those chunks of data. The dimensionality of chunkshape must be the same as that of shape (beware: no dimension should be 0 this time!). If None, a sensible value is calculated based on the expectedrows parameter (which is recommended).

  • byteorder – The byteorder of the data on disk, specified as ‘little’ or ‘big’. If this is not specified, the byteorder is that of the platform.

  • track_times

    Whether time data associated with the leaf are recorded (object access time, raw data modification time, metadata change time, object birth time); default True. Semantics of these times depend on their implementation in the HDF5 library: refer to documentation of the H5O_info_t data structure. As of HDF5 1.8.15, only ctime (metadata change time) is implemented.

    New in version 3.4.3.

Examples

See below a small example of the use of the EArray class. The code is available in examples/earray1.py:

import numpy as np
import tables as tb

fileh = tb.open_file('earray1.h5', mode='w')
a = tb.StringAtom(itemsize=8)

# Use ``a`` as the object type for the enlargeable array.
array_c = fileh.create_earray(fileh.root, 'array_c', a, (0,),
                              "Chars")
array_c.append(np.array(['a'*2, 'b'*4], dtype='S8'))
array_c.append(np.array(['a'*6, 'b'*8, 'c'*10], dtype='S8'))

# Read the string ``EArray`` we have created on disk.
for s in array_c:
    print('array_c[%s] => %r' % (array_c.nrow, s))
# Close the file.
fileh.close()

The output for the previous script is something like:

array_c[0] => 'aa'
array_c[1] => 'bbbb'
array_c[2] => 'aaaaaa'
array_c[3] => 'bbbbbbbb'
array_c[4] => 'cccccccc'

EArray methods

EArray.append(sequence)[source]

Add a sequence of data to the end of the dataset.

The sequence must have the same type as the array; otherwise a TypeError is raised. In the same way, the dimensions of the sequence must conform to the shape of the array, that is, all dimensions must match, with the exception of the enlargeable dimension, which can be of any length (even 0!). If the shape of the sequence is invalid, a ValueError is raised.

The VLArray class

class tables.VLArray(parentnode, name, atom=None, title='', filters=None, expectedrows=None, chunkshape=None, byteorder=None, _log=True, track_times=True)[source]

This class represents variable length (ragged) arrays in an HDF5 file.

Instances of this class represent array objects in the object tree with the property that their rows can have a variable number of homogeneous elements, called atoms. Like Table datasets (see The Table class), variable length arrays can have only one dimension, and the elements (atoms) of their rows can be fully multidimensional.

When reading a range of rows from a VLArray, you will always get a Python list of objects of the current flavor (each of them for a row), which may have different lengths.

This class provides methods to write or read data to or from variable length array objects in the file. Note that it also inherits all the public attributes and methods that Leaf (see The Leaf class) already provides.

Note

VLArray objects also support compression although compression is only performed on the data structures used internally by the HDF5 to take references of the location of the variable length data. Data itself (the raw data) are not compressed or filtered.

Please refer to the VLTypes Technical Note for more details on the topic.

Parameters:
  • parentnode – The parent Group object.

  • name (str) – The name of this node in its parent group.

  • atom – An Atom instance representing the type and shape of the atomic objects to be saved.

  • title – A description for this node (it sets the TITLE HDF5 attribute on disk).

  • filters – An instance of the Filters class that provides information about the desired I/O filters to be applied during the life of this object.

  • expectedrows

    A user estimate about the number of row elements that will be added to the growable dimension in the VLArray node. If not provided, the default value is EXPECTED_ROWS_VLARRAY (see tables/parameters.py). If you plan to create either a much smaller or a much bigger VLArray try providing a guess; this will optimize the HDF5 B-Tree creation and management process time and the amount of memory used.

    New in version 3.0.

  • chunkshape – The shape of the data chunk to be read or written in a single HDF5 I/O operation. Filters are applied to those chunks of data. The dimensionality of chunkshape must be 1. If None, a sensible value is calculated (which is recommended).

  • byteorder – The byteorder of the data on disk, specified as ‘little’ or ‘big’. If this is not specified, the byteorder is that of the platform.

  • track_times

    Whether time data associated with the leaf are recorded (object access time, raw data modification time, metadata change time, object birth time); default True. Semantics of these times depend on their implementation in the HDF5 library: refer to documentation of the H5O_info_t data structure. As of HDF5 1.8.15, only ctime (metadata change time) is implemented.

    New in version 3.4.3.

Changed in version 3.0: parentNode renamed into parentnode.

Changed in version 3.0: The expectedsizeinMB parameter has been replaced by expectedrows.

Examples

See below a small example of the use of the VLArray class. The code is available in examples/vlarray1.py:

import numpy as np
import tables as tb

# Create a VLArray:
fileh = tb.open_file('vlarray1.h5', mode='w')
vlarray = fileh.create_vlarray(
    fileh.root,
    'vlarray1',
    tb.Int32Atom(shape=()),
    "ragged array of ints",
    filters=tb.Filters(1))

# Append some (variable length) rows:
vlarray.append(np.array([5, 6]))
vlarray.append(np.array([5, 6, 7]))
vlarray.append([5, 6, 9, 8])

# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
    print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

# Now, do the same with native Python strings.
vlarray2 = fileh.create_vlarray(
    fileh.root,
    'vlarray2',
    tb.StringAtom(itemsize=2),
    "ragged array of strings",
    filters=tb.Filters(1))
vlarray2.flavor = 'python'

# Append some (variable length) rows:
print('-->', vlarray2.title)
vlarray2.append(['5', '66'])
vlarray2.append(['5', '6', '77'])
vlarray2.append(['5', '6', '9', '88'])

# Now, read it through an iterator:
for x in vlarray2:
    print('%s[%d]--> %s' % (vlarray2.name, vlarray2.nrow, x))

# Close the file.
fileh.close()

The output for the previous script is something like:

--> ragged array of ints
vlarray1[0]--> [5 6]
vlarray1[1]--> [5 6 7]
vlarray1[2]--> [5 6 9 8]
--> ragged array of strings
vlarray2[0]--> ['5', '66']
vlarray2[1]--> ['5', '6', '77']
vlarray2[2]--> ['5', '6', '9', '88']

VLArray attributes

The instance variables below are provided in addition to those in Leaf (see The Leaf class).

atom

An Atom (see The Atom class and its descendants) instance representing the type and shape of the atomic objects to be saved. You may use a pseudo-atom for storing a serialized object or variable length string per row.

flavor

The type of data object read from this leaf.

Please note that when reading several rows of VLArray data, the flavor only applies to the components of the returned Python list, not to the list itself.

nrow

On iterators, this is the index of the current row.

nrows

The current number of rows in the array.

extdim

The index of the enlargeable dimension (always 0 for vlarrays).

VLArray properties

VLArray.size_on_disk

The HDF5 library does not include a function to determine size_on_disk for variable-length arrays. Accessing this attribute will raise a NotImplementedError.

VLArray.size_in_memory

The size of this array’s data in bytes when it is fully loaded into memory.

Note

When data is stored in a VLArray using the ObjectAtom type, it is first serialized using pickle, and then converted to a NumPy array suitable for storage in an HDF5 file. This attribute will return the size of that NumPy representation. If you wish to know the size of the Python objects after they are loaded from disk, you can use this ActiveState recipe.

VLArray methods

VLArray.append(sequence)[source]

Add a sequence of data to the end of the dataset.

This method appends the objects in the sequence to a single row in this array. The type and shape of individual objects must be compliant with the atoms in the array. In the case of serialized objects and variable length strings, the object or string to append is itself the sequence.

VLArray.get_enum()[source]

Get the enumerated type associated with this array.

If this array is of an enumerated type, the corresponding Enum instance (see The Enum class) is returned. If it is not of an enumerated type, a TypeError is raised.

VLArray.iterrows(start=None, stop=None, step=None)[source]

Iterate over the rows of the array.

This method returns an iterator yielding an object of the current flavor for each selected row in the array.

If a range is not supplied, all the rows in the array are iterated upon. You can also use the VLArray.__iter__() special method for that purpose. If you only want to iterate over a given range of rows in the array, you may use the start, stop and step parameters.

Examples

for row in vlarray.iterrows(step=4):
    print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))

Changed in version 3.0: If the start parameter is provided and stop is None then the array is iterated from start to the last line. In PyTables < 3.0 only one element was returned.

VLArray.__next__()[source]

Get the next element of the array during an iteration.

The element is returned as a list of objects of the current flavor.

VLArray.read(start=None, stop=None, step=1)[source]

Get data in the array as a list of objects of the current flavor.

Please note that, as the lengths of the different rows are variable, the returned value is a Python list (not an array of the current flavor), with as many entries as specified rows in the range parameters.

The start, stop and step parameters can be used to select only a range of rows in the array. Their meanings are the same as in the built-in range() Python function, except that negative values of step are not allowed yet. Moreover, if only start is specified, then stop will be set to start + 1. If you do not specify neither start nor stop, then all the rows in the array are selected.

VLArray.get_row_size(row)

Return the total size in bytes of all the elements contained in a given row.

VLArray special methods

The following methods automatically trigger actions when a VLArray instance is accessed in a special way (e.g., vlarray[2:5] will be equivalent to a call to vlarray.__getitem__(slice(2, 5, None)).

VLArray.__getitem__(key)[source]

Get a row or a range of rows from the array.

If key argument is an integer, the corresponding array row is returned as an object of the current flavor. If key is a slice, the range of rows determined by it is returned as a list of objects of the current flavor.

In addition, NumPy-style point selections are supported. In particular, if key is a list of row coordinates, the set of rows determined by it is returned. Furthermore, if key is an array of boolean values, only the coordinates where key is True are returned. Note that for the latter to work it is necessary that key list would contain exactly as many rows as the array has.

Examples

a_row = vlarray[4]
a_list = vlarray[4:1000:2]
a_list2 = vlarray[[0,2]]   # get list of coords
a_list3 = vlarray[[0,-2]]  # negative values accepted
a_list4 = vlarray[np.array([True,...,False])]  # array of bools
VLArray.__iter__()[source]

Iterate over the rows of the array.

This is equivalent to calling VLArray.iterrows() with default arguments, i.e. it iterates over all the rows in the array.

Examples

result = [row for row in vlarray]

Which is equivalent to:

result = [row for row in vlarray.iterrows()]
VLArray.__setitem__(key, value)[source]

Set a row, or set of rows, in the array.

It takes different actions depending on the type of the key parameter: if it is an integer, the corresponding table row is set to value (a record or sequence capable of being converted to the table structure). If key is a slice, the row slice determined by it is set to value (a record array or sequence of rows capable of being converted to the table structure).

In addition, NumPy-style point selections are supported. In particular, if key is a list of row coordinates, the set of rows determined by it is set to value. Furthermore, if key is an array of boolean values, only the coordinates where key is True are set to values from value. Note that for the latter to work it is necessary that key list would contain exactly as many rows as the table has.

Note

When updating the rows of a VLArray object which uses a pseudo-atom, there is a problem: you can only update values with exactly the same size in bytes than the original row. This is very difficult to meet with object pseudo-atoms, because pickle applied on a Python object does not guarantee to return the same number of bytes than over another object, even if they are of the same class. This effectively limits the kinds of objects than can be updated in variable-length arrays.

Examples

vlarray[0] = vlarray[0] * 2 + 3
vlarray[99] = arange(96) * 2 + 3

# Negative values for the index are supported.
vlarray[-99] = vlarray[5] * 2 + 3
vlarray[1:30:2] = list_of_rows
vlarray[[1,3]] = new_1_and_3_rows