Data Science 101 – Part 3: What to consider in a modeling dataset
A data model for machine learning predictions is only as good as your data. For starters, your modeling dataset must accurately portray the reality of how your business operates. Second, model building data needs to know the outcome of each case, or row, in the historical data. When these conditions are in place, you can develop models to learn which combinations of preconditions lead to each outcome.
In part 1 of this blog series, we talked about building machine learning models based on hypotheses that data can prove, i.e., business questions you can answer with data. Here we’ll take a look at what makes an effective modeling dataset to fuel AI and machine learning predictions.
Your dataset for modeling needs to represent your business reality. It should contain precursors consisting of descriptions of products, conditions, and actions that precede an event, and the results or outcomes that subsequently happen.
If you then take a random sample from the model building dataset that gives you two datasets that represent reality and contain both precursors and outcomes. You’ll use one sample for model building or model training. The other sample is a holdout that you use to test how accurate the final model is when using data that was not also used to train it.
What to watch for in your model building data
The first thing a model often reveals is that the data has not captured the true reality. In other words, the model uncovers seemingly interesting patterns that turn out to be misunderstandings due to an additional factor or mistakes in data preparation.
If this happens, you need to restructure the modeling dataset based on this new understanding, and then make and assess new models. You’ll need to repeat this cycle until data understanding accurately aligns with the behavior being modeled.
Typically, the first time you test a model using split-file validation, it will reveal structural problems in data preparation that you can correct. When the model is both accurate and stable between training and validation datasets, then you’ll test the same model using data from the next month in the sequence to make sure that model accuracy is maintained.
If the model accuracy declines, your modeling dataset may need additional factors as predictors. Or you might determine that the real world is also changing rapidly, which means you’ll need to refresh the model more frequently to stay on top of it.
Some companies wonder if they must have a data warehouse before they begin down the path to building a dataset for machine learning. “A central repository is not a prerequisite to data mining,” argues Beyond the Arc Data Scientist Bruce Johnson. “The first predictive analytics project is often a proof of concept, and it does not require a huge investment in upfront infrastructure. Data mining can deliver quick wins. The results of preliminary data mining efforts often bring very tangible value to the firm.”
Improving ML modeling — How Beyond the Arc can help
Our data science team can help you manage the modeling process from the beginning all the way through personalized, automated AI/machine learning solutions.
With 20+ years of experience, our data scientists are passionate about helping businesses use data to make better decisions and take action. They specialize in using machine learning and statistics to deliver actionable business insights.