Varos Glossary

Data Split

Data splitting is exactly what it sounds like — when data is split into multiple subsets. In the context of machine learning, one part of the split is typically used for training the machine learning models. The other parts may be used to test or validate the model once it's been fully trained. 

How Does Data Splitting Work?

Machine learning models are typically developed and trained from a large core of data. However, the entirety of that data set is rarely used exclusively for training purposes. While the majority is fed into the machine learning model, there's still a fairly large portion that is unused at this stage. How this data is used largely depends on what type of split you've chosen.

The two most common are the train-test split and the train-validation-test-split. The only real difference between the two is that the latter includes an additional data set used for validation. Here's how it works: 

  • Training: The majority of the data set is fed into the model so it can learn to identify and analyze patterns. 
  • Validation: The validation set is used to test the model over the course of its development. It allows the model's programmers to assess how it performs relative to other models and hyperparameter options, while also helping to identify potential problems. 
  • Testing: The test set is the final piece of data, intended to assess the finished model's accuracy. It simulates the real world in an effort to provide as unbiased a picture of the model's performance as possible. 

It's generally accepted that the optimal ratio of training to testing and validation is roughly 70-20-10. 

As for the method by which you split your data, there isn't really anything resembling guidelines or a framework. Instead, it's largely just a matter of preference. With that in mind, there are a few different sampling methods you might use to ensure your training and testing data is split as equitably as possible: 

  • Random sampling ensures your data sets are largely unbiased. However, it also tends to unevenly distribute data after splitting it. 
  • Stratified random sampling is the same as random sampling, with the exception of following several predefined parameters when distributing data. 
  • Nonrandom sampling, as the name suggests, occurs when the data modelers already know what data they wish to use. 
  • Train-test-split: 

Why is Data Splitting Important? 

Circling back around to the context of machine learning, data splitting helps ensure that your machine learning model functions how it's supposed to function. It also helps you avoid two of the most common issues with machine learning — overfitting and underfitting.

Overfitting occurs when a machine learning model is able to accurately parse training data, but incapable of generalizing to new data. Instead, every prediction it makes is skewed by its training data. This can occur for a few different reasons, including an insufficient data set, a data set that contains too much noise, or a data set that's too simplistic. 

Underfitting, on the other hand, occurs when a machine learning model cannot give accurate results for either training data or test data. This typically occurs when a machine learning model has not been trained for enough time or has not received enough data points. Fortunately, underfitting is a lot simpler to solve than overfitting, it's more an indication that your machine learning model is slightly undercooked than anything. 

Data splitting provides you with validation samples that allow you to test for both underfitting and overfitting, which in turn gives you an opportunity to address both issues.