On the messiness of data

For a data analyst, a large datasets can be a double-edge sword. When used properly, they can yield powerful insights and leave you with a deeper understaning of the world around you. However, in the absence of good data organisation skills, you are only left with wasted effort and confusion. I encountered this challenge firsthand during my PhD.

It took me many years to get a clear, high-level picture of data. If I had this picture when I started my PhD, I would have made different choices and could have saved siginificant amount of time. Some of the lessons that I learned through trial-and-error might be obvious to people who have been trained in data analysis or computer sciences or have been involed in data-related projects in the past. But, if you are a novice having to deal with large amounts of data, please read on.

Before going any further, let me briefly explain the setting. I performed experiments with fruitfly that involved introducing a genetic perturbation in the fly and then, studying the effects of the perturbation on its behaviour, specifically, its ability to walk.

Finally, I used Python for most of my research and still use it for all my personal projects. Moreover, with the widespread adoption of Machine Learning, python has been somewhat of a de facto programming language, unless you have very specific reasons for using another programming language. All this is to make the point that although the reasoning used in this post is universal, the recommendations limited to data formats that are used in Python.

Combining data

Imagine you have position (x, y) data for 10 animals, collected over 5 sessions per animal, with 5 stimulus conditions per session. There are many ways to organize this data. The simplest approach is to store each dataset in a separate file. This would result in:
2 (x and y) × 10 (animals) × 5 (sessions) × 5 (stimuli) = 500 files.

One might argue that it’s unnecessary to store x and y separately. Combining them would reduce the number of files to 250. What if we take this thought to its logical conclusion; we could consolidate all the data into a single file. This would be ideal: fewer files to track, lower risk of losing or misplacing data, and a cleaner, more centralized dataset.
However, doing so requires representing the data as a 4D structure: position × animal × session × stimulus. Is that possible? Lets keep this in mind while we move onto another important factor that needs to be considered when selecting a data format.

Data retrieval

The main purpose of storing data is of course, to be able to retrieve it when needed. And in order to retrieve data, there has to be a way to identify it. There are two ways in data is indentified and retrived from a dataset:

By filtering: Use some logic to reject/whittle down data till only your desired data remains
By indexing: Use an index that uniquely identifies each unit of data in the dataset

Ideally, we want a data format that makes retrieval simple and intuitive, so that anyone with access can slice and dice the dataset in whatever way they need, without requiring much additional context. The structure should be self-evident and easy to navigate.

Metadata and processing

Metadata, as the name suggest, refers to the data about data. In the context of the behavioural experiments that I described earlier, the real-data was the x,y position and the head direction of the fly; while the meta-data were things like fly identity, type of genetic perturbation, features of the visual stimulus etc.

You want the data and metadata to be linked with each other as tightly as possible. In fact, ideally you want them to be part of the same file and even, of the same type. A good data format should maintain the link between data and metadata throughout all stages of analysis. What do I mean by that? As data is processed, it often undergoes changes that require processing of the corresponding metadata:

Reduction: Data often gets reduced through operations like averaging. In the reduced data, certain metadata must be discarded while others retained. For example, if I average an animal’s speed across trials, the trial-level metadata is no longer relevant, but the animal ID must be preserved.
Combination: when combining data from two or more sources, their respective metadata must be merged appropriately to maintain context and traceability.

A good data format then, ensures that the real-data “remembers” the meta-data as it is reduced and combined through successive steps of processing.

Data formats

To summarise—when choosing a data format, we want it to possess three features:

Easy and intuitive retrieval of data
Extension to arbitrary number of dimensions
Linkage between data and metadata to be maintained across analysis

What options do we have in Python?

Numpy arrays: Think of numpy arrays as a huge collection of numbers that are arranged in an n-dimensional cuboid.

Pandas dataframe: This is the most powerful and widely used format for storing data, and also the easiest to visualize. If you’ve ever seen a table in Excel or any data analysis software, you’ve seen a DataFrame. One major advantage of pandas DataFrames is their ubiquity in the Python ecosystem. They integrate seamlessly with other libraries, such as Seaborn and Plotly—two powerful tools for data visualization.

In NumPy arrays, each data point is accessed using a list of indices. Since these indices are just arbitrary numbers, it’s difficult to associate them with physically meaningful features.
DataFrames offer more flexibility in this regard: columns can be labeled with names, and rows can have meaningful indices (either numbers, as in serial IDs or strings). Additionally, filtering allows you to easily select specific subsets of rows.

This presents a tradeoff: NumPy arrays support arbitrary dimensions but data retrieval is difficult due to numerical indices; DataFrames offer easier retrieval capabilities but are restricted to just two dimensions. The ideal dataformat would be one that combines these two features.
This is where xarray comes in. Developed by data scientists working with geological data, xarray is basically an n-dimensional dataframe. The best way to get started with xarray is of course the tutorial, prepared and maintained by creators. Nonetheless, I will briefly explain the defining features of xarray and how it fits into the ideas presented above.

Xarray has 3 components:

The data, shapes like an n-dimensional cuboid
Each dimension has a name. In our behavioural dataset, the dimensions would be Position, Animals, Sessions and Stimuli
Coordinates: The values along dimensions are called coordinates. dim: Timel coords: 0, 1, 2, 3…. dim: Position; coords: x, y dim: Animals; coords: animal1, animal2, animal3…… dim: Sessions, coords: 1, 2, 3….
Together, these components make Xarray a powerful way to work with structured multi-dimensional data. By naming dimensions and assigning meaningful coordinates, you can index, slice, and align data intuitively without having to remember raw array indices. This makes analysis more readable and less error-prone.

Ah! I see so xarrays are the way to go? Well, yes, but not always. In my own case, it took me some time to realise that xarrays weren’t the answer to all my prayers after all. In my case, the dimensionality was low, the number of unique stimulus features was small enough that a pandas DataFrame might have been sufficient. I still don’t regret using xarray because it has some very nice features like in-built methods for plotting data that made life much easier but, if I had read this blog post before I started my PhD, then, I might have stuck with pandas.