My personal notes for Everything You Need to Know to Start Your First Machine Learning Project

A talk by Kirsten Westeinde I attended at DeveloperWeek 2019

1. Identify the Problem You Want to Solve

Types of machine learning problems:

  • Supervised: You can use past data to derive answers to future problems.
  • Unsupervised: The data set doesn't have any answers and the object is to find some structure.

You usually start with an unsupervised model to identify the structure, and then use supervised as the product.

There are two types of supervised problems:

  • Classification: The answer is a discreet value.
  • Regression: The answer can be any continuous value.

If you can frame your problem as a classification or regression problem, then it's a good candidate for machine learning.

Machine learning is resource expensive, so if there's a simple heuristic system that's "good enough," then use that.

2. Frame the Problem as a Machine Learning Model

It's really helpful to include a problem domain expert on the team.

  • Prediction target: The thing your model is trying to predict. Everyone needs to agree on this. This can be harder to determine than it seems. You should codify and unit test this.
  • Labeled data: You need a data set where you know the answers.
  • Data preprocessing: Transform the data to a format that works for your model. As an example, you can use normalization to get all of your data into the same range. This is important when weighing parameters.
  • Features: The parameters you want to include in your algorithm.
  • Acceptance criteria: You need to know: "when will my model be good enough?" This is usually in terms of accuracy.

You also want to include data points from a wide variety of situations so you don't introduce bias in your model. You also need to be careful that you don't encode biases that are already in the system. You can algorithmically generate data from a smaller data set in some cases (such as applying distortion to text input.)

When choosing features, mix your own intuition with statistical analysis.

Could a human expert confidently predict the outcome given these pieces of information?

You need to evaluate the quality of the data for long-term consistency. This is an ongoing process.

3. Train the Model

Many times libraries will do this for you. The process is also usually well-documented.

4. Productizing the Model

PMML is a standardized language for expressing a machine learning model and its results. Most languages have packages that can interpret it.

After training your model, you need to verify its accurate. The first step is to figure out the accuracy of your model in comparison to a test set of data.

You can also release your model in "shadow mode," where it's making predictions but isn't actually sharing those with the user. You need to do this every time you do an update to your model.

You'll also want to have alerts set up for your algorithm's accuracy. And automate the pipeline as much as possible.

More likely than not, your first model will be bad. You have to be willing to see it as a process and be willing to iterate.

In the future, Tensorflow allows you to deploy a model to GKE and then make predictions in real time. If you're just starting out, Tensorflow has a low barrier to entry.