Excel,CSV to PyTorch Dataset

Shashika Chamod
4 min readMar 15, 2020

--

PyTorch is becoming the fastest growing Machine Learning framework in terms of community, support and usability. I have been working with Tensorflow for a while and few days before, I thought to rush through PyTorch and get some basic understanding.

Rather than following the exact beginner’s level tutorial given in PyTorch official website and many other sites, I always prefer to deal with my own dataset. The commonest guide is to use torchvision object to load MNIST dataset. As I already mentioned, I have an excel(csv) which contains data related to bio-electrical signals from set of patients.(I involve in biomedical related research!!). This is the dataset that, I wanted to use with my model. I surfed in the internet and couldn’t find an end to end answer of how to do this. With several readings I was able to put together the pieces and I got the job done.

Note: This article is not here to describe the PyTorch model building and training, but to show how to load excel,csv .. formatted dataset instantly to PyTorch model.

At the time of this article, the environment in my PC is as follows.

Conda with Python 3.7

torch 1.4

pandas 0.25.1

The CSV file that I have, contains 216 data entries of 11 features and one label for each.

Dataset CSV

Now, you have the basic understanding of my dataset (or any similar dataset). Without further ado let’s go through the code.

For the purpose, I have implemented two classes.

  1. Dataset class
  2. Main class(model class)

Dataset class

Dataset class is described mainly in this article. (It is the trick !!))

Don’t be panic. I will describe line by line.

pandas — to load data

torch, torch.utils.data — to generate torch tensors and have torch compatible dataset

sklearn (scikit-learn) — data splitting and pre-processing

The ‘FeatureDataset’ class is our customized class for data preparation for our torch model (You can have your own name). It is inherited from the parent abstract class Dataset. Dataset is the object type accepted by torch models. !!Now you see some intuition behind our work !!.

First, we build the constructor. The argument passed to the constructor is the file_name (ideally the file path). Pandas has been used to read the excel/csv file. Then, the file output is separated into features and labels accordingly. (We have to read our file and know the structure of the data or else this can be dynamically set).

Separating features(x) and labels(y)

Note: We do not do separation of dataset into train and test set in this example. Instead, we consider complete file as training set. This is because at training the torch model can separate data accordingly for given train:test ratio.

Next we scale our dataset using StandardScalar object to standardized the features. This step may or may not be required depending on your dataset.

Finally, we convert our dataset into torch tensors.

Converting to torch tensors

Having dtype = torch.float32 to cast data into float32 may be essential as default dtype returned after scaling is of double and not accepted by torch model.

As Dataset is an abstract class, we must overwrite its unimplemented methods. You can see that function __getitem__ and __len__ is overwritten in this example. __getitem__ is a MUST be overwritten function whereas __len__ is optional. They have the literal functionality of getting single data point by index and returning the dataset length.

Overwritten functions __len__ and __getitem__

And that’s it. Simple, isn’t it?

Now, you have completed the essential parts of excel/csv data loading to pytorch model.

main class

Let’s move to main class and see how we can use our Dataset class. What is left to do is very simple in the main class. Import the Dataset class and pass the csv/excel file as an argument.

Create dataset object by passing csv file path as an argument

Now we use DataLoader for final preparation and batch separation of theDataset (feature_set)

Training dataset preparation

This Dataloader object (train_loader) can be used in pytorch model. For the completion of this post, I would add the main file below, which shows the entire training procedure. You can get more understanding about training process from pytorch official web site.

Main class

Reference

  1. https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

2. https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel

--

--