The NWB Format is a core component of the Neurodata Without Borders: Neurophysiology (NWB:N) project. The NWB format is designed to store general optical and electrical physiology data in a way that is both understandable to humans as well as accessible to programmatic interpretation. The format is designed to be friendly to and usable by software tools and analysis scripts, and to impose few a priori assumptions about data representation and analysis.
The NWB format uses the following main primitives to hierarchically organize neuroscience data :
- A Group is similar to a folder and may contain an arbitrary number of other groups and datasets,
- A Dataset describes an n-dimensional array and provides the primary means for storing data,
- An Attribute* is a small dataset that is attached to a specific group or dataset and is typically used to store metadata specific to the object they are associated with, and
- A Link is a reference to another group or dataset.
The NWB format is formally described via formal specification documents using the NWB specification language . HDF5 currently serves as the main format for storing data in the NWB format (see http://nwb-storage.readthedocs.io/en/latest/ for details). The PyNWB API is available to enable users to efficiently interact with NWB format files.
The NWB format uses a modular design in which all main semantic components of the format have a unique neurodata_type (similar to a Class in object-oriented design)(Section 1.1). This allows for reuse and extension of types through inclusion and inheritance. All datasets and groups in the format can be uniquely identified by either their name and/or neurodata_type.
Two important base types in the NWB format are NWBContainer and TimeSeries. NWBContainer defines a generic container for storing collection of data and is used to define common features and functionality across data containers (see Section 1.2). TimeSeries is a central component in the NWB format for storing complex temporal series (see Section 1.3). In the format, these types are then extended to define more specialized types. To organize and define collections of processed data from common data processing steps, the NWB:N format then defines the concept of ProcessingModule where each processing step is represented by a corresponding NWBDataInterface (an extension of NWBContainer) (see Section 1.4 for details).
At a high level, data is organized into the following main groups:
- acquisition/ : data streams recorded from the system, including ephys, ophys, tracking, etc.,
- intervals/ : experimental intervals,
- stimulus/ : stimulus data,
- general/ : experimental metadata, including protocol, notes and description of hardware device(s).
- processing/ : standardized processing modules, often as part of intermediate analysis of data that is necessary to perform before scientific analysis,
- analysis/ : lab-specific and custom scientific analysis of data.
The high-level data organization within NWB files is described in detail in Section 2.5.2. The top-level datasets and attributes are described in Table 2.22 and the top-level organization of data into groups is described in Table 2.23.
neurodata_type : Assigning types to specifications¶
The concept of a neurodata_type is similar to the concept of a Class in object-oriented programming. In the NWB format, groups or datasets may be given a unique neurodata_type. The neurodata_type allows the unique identification of the type of objects in the format and also enable the reuse of types through the concept of inheritance. A group or dataset may, hence, define a new neurodata_type while extending an existing type. E.g., AbstractFeatureSeries defines a new type that inherits from TimeSeries.
NWBDataInterface: Base neurodata_types for containers and datasets¶
NWBContainer is a specification of a group that defines a generic container for storing collections of data. NWBContainer serves as the base type for all main data containers (including TimeSeries) of the core NWB:N data format and allows us to define and integrate new common functionality in a central place and via common mechanisms (see Section 2.1.3).
NWBDataInterface extends NWBContainer and serves as base type for primary data (e.g., experimental or analysis data) and is used to distinguish in the schema between non-metadata data containers and metadata containers (see Section 2.1.4).
The concept of NWBContainer and NWBData have been introduced in
NWB:N 2. NWBDataInterface (also introduced in NWB:N 2) replaces
from NWB:N 1.x.
Interface was renamed to NWBDataInterface to ease intuition and
the concept was generalized via NWBContainer to provide a common base for
data containers (rather than being specific to ProcessingModules as in NWB:N 1.x).
Time Series : A base neurodata_type for storing time series data¶
The file format is designed around a data structure called a TimeSeries which stores time-varying data. A TimeSeries is a superset of several neurodata_types, including signal events, image stacks and experimental events. To account for different storage requirements and different modalities, a TimeSeries is defined in a minimal form and it can be extended (i.e., subclassed) to account for different modalities and data storage requirements (see Section 1.5).
Each TimeSeries has its own HDF5 group, and all datasets belonging to a TimeSeries are in that group. In particular, a TimeSeries defines components to store data and time.
The data element in the TimeSeries will typically be an array of any valid HDF5 data type (e.g., a multi-dimensional floating point array). The data stored can be in any unit. The attributes of the data field must indicate the SI unit that the data relates to (or appropriate counterpart, such as color-space) and the multiplier necessary to convert stored values to the specified SI unit.
TimeSeries support provides two time objects representations. The first, timestamps, stores time information that is corrected to the experiment’s time base (i.e., aligned to a master clock, with time-zero aligned to the starting time of the experiment). This field is used for data processing and subsequent scientific analysis. The second, sync, is an optional group that can be used to store the sample times as reported by the acquisition/stimulus hardware, before samples are converted to a common time-base and corrected relative to the master clock. This approach allows the NWB format to support streaming of data directly from hardware sources.
In addition to data and time, the TimeSeries group can be used to store additional information beyond what is required by the specification. I.e., an end user is free to add additional key/value pairs as necessary for their needs via the concept of extensions. It should be noted that such lab-specific extensions may not be recognized by analysis tools/scripts existing outside the lab. Extensions are described in section (see Section 1.5).
1.4. Data Processing Modules: Organizing processed data¶
NWB:N uses ProcessingModule to store data for—and represent the results of—common data processing steps, such as spike sorting and image segmentation, that occur before scientific analysis of the data. Processing modules store the data used by software tools to calculate these intermediate results. All processing modules are stored directly in the group /processing. The name of each module is chosen by the data provider (i.e. processing modules have a “variable” name). The particular data within each processing module is specified by one or more NWBDataInterface, which are groups residing directly within a processing module. Each NWBDataInterface has a unique neurodata_type (e.g., ImageSegmentation) that describes and defines the data contained in the NWBDataInterface. For NWBDataInterfaces designed for use with processing modules, a default name (usually the same as the neurodata_type) is commonly specified to further ease identification of the data in a file. However, to support storage of multiple instances of the same subtype in the same processing module, NWB:N allows users to optionally define custom names as well.
1.5. Extending the format¶
The data organization presented in this document constitutes the core NWB format. Extensibility is handled via the concept of extensions, allowing users to extend (i.e., add to) existing and create new neurodata_types definitions for storing custom data. To avoid collisions between extensions, extensions are defined as part of custom namespaces (which typically import the core NWB namespace). Extensions to the format are written using the Specification Language . To ease development of extensions, the PyNWB (and HDMF used by PyNWB) API provides dedicated data structures that support programmatic creation and use of extensions. An example for extending NWB using PyNWB is available at https://pynwb.readthedocs.io/en/stable/extensions.html and additional details are also available as part of the PyNWB tutorials https://pynwb.readthedocs.io/en/stable/tutorials/index.html .
Creating extensions allows adding and documenting new data to NWB, interaction with custom data via the API, validation of custom data contents, sharing and collaboration of extensions and data. Popular extensions may be proposed and added to the official format specification after community discussion and review. To propose a new extensions for the NWB core format you may file an issue at https://github.com/NeurodataWithoutBorders/nwb-schema/issues .
1.5.1. Extending TimeSeries and NWBContainer¶
Like any other neurodata_type, TimeSeries can be extended via extensions by defining corresponding derived neurodata_types. This is typically done to to represent more narrowly focused modalities (e.g., electrical versus optical physiology) as well as new modalities (e.g., video tracking of whisker positions). When a neurodata_type inherits from TimeSeries, new data objects (i.e., datasets, attributes, groups, and links) can be added while all objects of the parent TimeSeries type are inherited and, hence, part of the new neurodata_type. Section Section 2.1.5 includes a list of all TimeSeries types.
Extending NWBContainer works in the same way, e.g., to create more specific types for data processing.
1.6. Common attributes¶
All NWB:N Groups and Datasets with an assigned neurodata_type have three required attributes: neurodata_type, namespace, and object_id.
neurodata_type(variable-length string) is the name of the NWB:N primitive that this group or dataset maps onto
namespace(variable-length string) is the namespace where
neurodata_typeis defined, e.g. “core” or the namespace of an extension
object_id(variable-length string) is a universally unique identifier for this object within its hierarchy. It should be set to the string representation of a random UUID version 4 value (see RFC 4122) upon first creation. It is not a hash of the data. Files that contain the exact same data but were generated in different instances will have different
object_idvalues. Currently, modification of an object does not require its
object_idto be changed.