How Much Data Does AI Really Need?

These days, there’s a lot of hype and uncertainty surrounding AI (Artificial Intelligence) and ML (Machine Learning). It can get pretty difficult to understand what these tools actually accomplish, and to tell when they are appropriate or applicable. It’s complicated stuff, and it’s made worse by the fact that companies love to put the “AI” moniker on anything even remotely automated in order to excite their clients.

Talking to customers, I’ve heard one particular concern come up over and over: don’t you need a lot of data in order to use modern AI/ML approaches? The answer, actually, is both yes and no. Before I elaborate on that, let’s take a brief detour into why this conception of AI exists and why it has been true for much of AI’s history.

What exactly is AI?

Broadly, the terms AI and ML refer to a way of writing computer programs that come up with their own rules of operation. Think of it this way: normally, when you’re writing a computer program, you give the computer some inputs and you tell it what to do with those inputs so that it can produce some output. For example, to write a program to find the sum of three numbers, you would tell the machine to add each number in the list to the next number in the list and then show you the result of the final addition.

Meanwhile, when training an AI/ML model, you instead give the computer a series of inputs and outputs, and the model learns a way of getting from the inputs to the outputs on its own. For example, to accomplish the same task as above, you might provide the following dataset:

Input Output
1,2,3 6
2,4,6 12
-4,8,20 24
90,200,310 600

And, with enough data points, the model learns a way to consistently get from input to output. Of course, with addition, it would be much easier to just write a program than to train a model. However, for harder tasks (like understanding English sentences, for example), writing a series of rules for the computer to follow is almost impossibly difficult. So, instead, we rely on AI/ML models to do the heavy lifting for us, and provide them with the data they need to figure it out. If you’ve heard terms like Deep Learning or Neural Networks, those are just certain (particularly flexible) types of AI/ML models that are good at dealing with particularly complex inputs.

The problem, though, is that there are usually a lot of different ways to get from inputs to outputs in a dataset, and most of them probably aren’t what you’re trying to get the model to learn. For example, take another look at that dataset in the table above. Our intention, of course, is for the model to learn to add up the 3 numbers every time. However, notice how the model would also get the right answer for every one of those inputs if it just multiplied the second number by 3!

Generally, models will try to find the easiest possible solution to the dataset you provide. If there’s only a small amount of data, they’ll probably find some shortcut that doesn’t actually solve the problem you’re looking to solve. This issue is known as overfitting, and a good way to address it is typically to add more data. With more and more data, it becomes increasingly likely that the only model that can account for all the data is one that solves the desired problem. This is especially true for Neural Network models like the ones we use at Toucan AI, and even more true for hard problems like Language Understanding– without a ton of data, there’s no way our models can learn to actually understand what the user’s inputs mean instead of just finding some trick that doesn’t generalize.

However, Toucan AI doesn’t require any training data from our customers. How is that possible? It’s because of an awesome technique called Pre-training!

AI without the Data

  Don’t you need a lot of data in order to use modern AI/ML approaches?

Returning to our question from above, so far it sounds like the answer is a resounding “yes, you do." And, in a sense, it is; there’s no way to train ML models without substantial amounts of data. That’s why AI/ML often seems inaccessible to smaller-scale businesses, and it’s also why most chatbot platforms require an exhausting amount of manual labor coming up with every possible phrasing of every possible question before the bot can actually function. However, an important caveat is that technically, the training data doesn’t need to be about the exact same problem as the one you want to use the model on.

For example, if you want an ML model that can handle sales conversations, one approach is to train it on a dataset of sales conversations regarding your specific products. But the other approach, the one we take at Toucan AI, is to train the models on much larger datasets focused on the general problem of understanding the English language. Then, you can essentially just feed the model information about your products, and have it understand that information and be able to talk about it.

This Pre-training approach requires much more data, since the problem of understanding English is much harder than the problem of responding to a handful of store-specific questions. Luckily, the internet is full of data in English, and basically any English text is valid as training data for learning Language Understanding. Basically, we do the hard work of collecting data and training models, so that our customers can just use our models on their own product catalog and have them function right out of the box. Then, over time, we use the conversations that occur on each customer’s website to fine-tune their version of the model, making it more and more effective at those conversations. So, while we need a ton of data for training, our customers don’t really need any at all to get started!

We’re super excited by this concept of pre-training because we think it represents a huge leap towards democratizing AI/ML and making it accessible to businesses of all scales, not just those that have huge datasets on hand. And, perhaps more importantly, we think it’s actually much more effective than the other approach – after all, wouldn’t you rather have your chat agent actually understand what your customer is asking about, rather than just parroting back a few canned responses?

- The Toucan Team