Considerations in Building Production ML

The terms “machine learning” (ML) and “deep learning” have been abuzz for a few years now following the rise of data and computing power. While it seems cool to get on the ML bandwagon, the first step for a company is to assess whether the business can really benefit from it — a separate post in itself. Since your company has decided that ML is a necessary next step, it’s time for you, an ML engineer, to consider what truly goes into building ML processes for production systems. Hopefully, this post can help enlighten some of those considerations.

In this post, whenever the term “startup” is mentioned, a software “product” company is implied, unless a software “service” company is specifically mentioned. A software product company is focused on building its own software product or products, while a software service company (like an agency or consulting firm) builds software for clients. While this post was written with ML engineers at early-stage software product startups in mind, some of the considerations might still be applicable to other stages or company types.

Finding the tools that work for you

There are multiple ML software options out there, from open-source tools like PyTorch, TensorFlow, and scikit-learn to managed ML services like Google AI Platform, Amazon SageMaker, and Azure Machine Learning. With PyTorch and TensorFlow alone, there are a host of open-source libraries like Hugging Face’s Transformers that provide ready-to-use models as a starting point. In addition, ML research papers in different domains are published every year, and some of them provide the code as open source. Papers With Code is a fine resource for finding papers with code.

Key factors to consider in choosing the right tools are:
- The quality of documentation
- The development status of the tool (maintained vs. halted or deprecated, severity of issues, etc.)
- The ecosystem of other tools surrounding the tool
- The developer community engaged with the tool
- Your team’s own familiarity with the tool
- The size of your team that will use the tool
- The monetary costs involved with the tool

Personally, I feel that if you’re an early-stage startup, you don’t have to have all of these factors weighed and figured out. You can always begin with a strong candidate tool and go from there. Also, it’s alright to use a managed ML service from the start if you feel that the benefits would outweigh the costs.

When to work on ML vs. the processes that surround ML

When first starting out, it’s good to start with a simple baseline model. Starting with a simpler model will help you debug issues with your pipeline and help you determine if a more time-consuming solution is even worth it. So how do you come up with a simple baseline model?

To start, “simple” is relative. In some cases, simple really means a simple model, such as hardcoding a few heuristics. While in other cases, the model itself might be complex but easy to apply. Some of the most widely used datasets have state-of-the-art models that are open source, listed in places like research papers or leaderboards; the Stanford Question Answering Dataset (SQuAD) has such a leaderboard. One way to begin is to look at a few of the top solutions and see if you can find the code attached to the linked research paper.

At an early-stage startup, you may not have the time to build out ML processes immediately. Usually, you need to focus on getting something up and running that your investors and customers will readily see. Refined processes are rarely on their minds. So don’t worry about perfection in your initial deployments; there just needs to be a working outcome — a visible end product. You can worry more about the processes surrounding ML after the base product has been built, when there is usually more downtime to apply small, incremental improvements to the ML processes.

Conversely, if you’re an agency, there’s less room for mistakes, since you’re shipping out completed products to various clients, trying to fix all the bugs beforehand. After one or a set of client products are shipped, you’re moving on to the next client contract and usually don’t have the bandwidth to make further improvements. Nevertheless, you still have to move fast at an agency. In order to move faster, it would be helpful to have more refined ML processes. Hence for the agency model, perhaps spending more time upfront on refinement and automation can save time in the long run.

Considerations in experiment management

Managing experiments in ML is not a trivial task; it’s easy for your project workspace to get messy when you’re in the midst of running as many experiments as you can. Yet at an early-stage startup, you don’t have months to run hundreds of experiments. You just have to push for something better and then make updates as soon as possible. Regardless, having some sort of experiment management in play is better than having none at all. Here are a number of items to consider when managing ML experiments.

Model versioning

At Toucan AI, we use GitHub to store versions of our code. GitHub is great, but it’s not made for versioning large data files. Although a repository can be as large as 100GB, GitHub recommends keeping the repository’s size under 1GB; furthermore, a single file cannot exceed 100MB.

You can use other cloud storage options like Google Cloud Storage or Amazon S3. Using the cloud provider’s command-line tools or web user interface, simply create a bucket (folder) that allows the versioning of its objects (files or folders). However, additional manual work is required if you wish to sync the files you have in cloud storage with your project repository on GitHub.

We therefore opted for the most natural integration that combines the best of what Git platforms and other cloud storage options have to offer: Data Version Control (DVC), tagged as an “Open-source Version Control System for Machine Learning Projects.” DVC is a command-line tool with subcommands that are very similar to Git subcommands. After Git platform and cloud storage setup, you can run DVC add and push commands to version and store a file or folder in cloud storage. Meanwhile, your large data files are tracked within your Git project repository via DVC file references. One advantage of DVC is that only a few extra Git-like commands are necessary, which is not a huge break from your existing Git workflow.

Documenting experiments

If you’re hyperparameter-tuning, it’s easy to lose sight of the specific settings you ran for a certain model on a set date. You might also have to look back at what you did to prepare or preprocess the dataset required by said model. Your Jupyter Notebooks have descriptive filenames, but it still takes quite a few seconds to process what happened first or if you applied preprocessing A or B to experiment 7.

A solution is to number your notebooks as part of the filename as you create new ones (I like to use step 01_), and you can always renumber them later. Having a clear naming convention and numbering your notebooks is extremely helpful for your teammates (and future you!) to understand how you went about your experiments. Aside from numbering notebooks in practice, we use the open-source platform MLflow to provide a web interface for viewing experiment hyperparameters and metric results.

In addition, aim for logical structure and conciseness when documenting experiments. Make good use of folder structure and names to organize your notebooks and training scripts for one model over another. Assume that when a reader views your notebook, they’ll read it from top to bottom, so remove any “scratch” cells that you’ve inserted temporarily. As a rule of thumb, limit the trials in a notebook to one model and one dataset, and create a new notebook if the current one is getting too long. Try to keep your final notebook versions free of training or inference code; these should be placed in separate scripts that can be called within notebooks. Lastly, when using software like MLflow that generates experiment records, attempt to automate referencing the notebook that ran an experiment to the generated experiment output files.

Testing framework

A better metric result does not always correlate with improved inference performance on your real-world examples. Furthermore, in a production ML system, ML models do not operate in isolation: for instance, you might have heuristics, preprocessing, and caching as part of the pipeline. So when trying to improve the ML models you already have, you realize that coming up with inference examples suitable for the real world takes a substantial amount of time. You need to dig into the larger production code to discover at what points the model you’re trying to improve is actually called. Then you don’t want to check inputs and outputs to only the model itself but to the entire ML system pipeline as well. How does your “better” model affect the overall system, and is it for better or worse?

In order to focus on model improvements rather than coming up with inference examples or worrying about breaking something in the production pipeline, we need to have an automated system or end-to-end testing framework in place.

At Toucan AI, since our primary product is an AI sales agent, testing sample conversations that cover the main branches of logic is sufficient and also provides a form of regression testing. We’re currently working on a command-line interface (CLI) tool that will run pytest assertions on a series of sample conversations. With one command, all conversations can be tested, and if any test cases break, we can then manually update the tests or deem that our “better” models are not actually better-fit for production.

In short, having a testing framework in place is crucial to understanding how your current and experimental models would perform in your production ML system. With a decent testing framework in place, your model improvement pipeline should move more efficiently, allowing you to run a greater number of experiments than before.

Working with rapidly evolving tools

It’s difficult to take code from a rapidly evolving library and write it into your production system that uses a modified older version of the library. How do you modify a rapidly evolving library to fit your needs and apply its latest updates as efficiently as possible?

I feel that there’s no right answer, just many different paths. One path is to combine bits of their code and bits of your code and make it work. Another path is to take their code and upgrade your older version entirely, but this usually ends up taking longer. In short, consider how much time you have for refactoring and whatever the priority is. Focus on the priority and worry about a complete refactoring later, after your own codebase and the rapidly evolving tool are more stable.

Experiment cleanup

It’s often easy to overlook tidiness when your focus is on getting results. You think about the next set of experiments to run, along with its set of hyperparameters. An error occurred? No problem, change the timestamp on the output folder and rerun the experiment. However, what you end up having are extra files or folders that were generated due to incomplete trials. Then future you ends up scratching their head while scrolling through a long list of records in MLflow, looking for experiments that ran to completion.

The solution is to automate the deletion of all trial runs that you don’t want to save. For instance, it’s best to delete runs that fail even before the first training iteration has completed execution. We should all try our best to keep the pool of experiments clean for our teammates and future selves.

A separation of concerns

As you research and try out various ML projects with the hopes of improving your models, you’ll come across conflicting Python package requirements. You might initially have one cloud server shared between two developers, but that quickly becomes inconvenient as your installs might overwrite your teammate’s.

Enter the Docker ecosystem, a lightweight containerization software platform to manage your project environments and dependencies. With separate Docker containers for each ML model and application service, we can proactively reduce the amount of “it works on my machine” issues and prevent dependency conflicts from occurring between projects. Instead of setting up more development servers, each of your teammates can set up their own Docker containers on a single shared server, if that happens to be more cost-effective.

Furthermore, you might be wondering, why choose Docker over Conda, when Conda also lets you create separate environments with different package versions? We chose Docker because the tools it provides are better for production and for working with the cloud. If you were to use Conda on a remote machine, you’d have to connect with the machine first and deal with file transfers. But with just a few commands in Docker, you can make changes to your local files and have them reflected in the remote machine’s Docker container. Moreover, everything required to run the project is specified in the Dockerfile or Docker Compose file.

With Conda, on the other hand, it’s not as clear if there are additional steps required without referring to a README. Lastly, with the power of Docker Compose, if an ML project needs other services to run, they can be run separately in other Docker containers and communicate with each other based on Docker Compose file settings. To my knowledge, you can’t communicate across environments in Conda.

Preparing to scale if needed

As an early-stage startup, you might not need to scale at this moment, but it’s great to start thinking about technologies that do scale. One of those technologies is Celery, an asynchronous task queue system that can distribute tasks across multiple workers. We currently have a single worker for each type of service (server, client, the embeddings model, etc.), but if necessary, it shouldn’t take much effort to spin up more workers for the same service. Is caching via embeddings the bottleneck? No problem, let’s spin up another embedding Celery worker or increase the current worker’s concurrency count, which allows a number of child processes to run in parallel. In our configuration at Toucan AI, a single Celery worker is run within a Docker container, thereby also following a separation of concerns.

Apart from allowing your production system to scale, Celery is great for executing long-running tasks such as an ML model inference task. Rather than allowing the server response to hang, the server response (the agent’s reply) can be returned immediately to the end user conversing with the Toucan AI agent, while an asynchronous task, such as the caching mechanism, can run quietly in the background. Moreover, we use Celery beat to run our analytics worker task scheduled daily. Celery beat facilitates scheduling worker tasks using cron.

Collaboration with teammates and future you

With the frequency of ML research publications, how do you as an ML engineer stay in the loop of what models or techniques your teammates have been working on or already tried? There is no magic spell to transfer all the knowledge, experience, and insight they’ve gained over to you. But what you can do is to communicate. And to communicate often.

Communicate whenever you can, especially if it’s with writing. Because oftentimes, you’re working on your own projects, so what you’re doing right now might not be completely relevant to what your teammate is working on. Nevertheless, there may come a time in the future when they need to review or extend what you’ve implemented. That might even be you months down the line, having to make changes to your own project, the meat of which you’ve already forgotten. So document, document, document. I can’t stress that enough.

On another note, there will certainly be times when the written word is not enough. Being mindful of your teammate’s “in the zone” mental focus state, seek a discussion with them regarding the direction of a project, if you have general uncertainty, when you want their honest opinion, and whenever you feel that spoken words are a more effective medium of communication. Clarity, as much as possible from the get-go, is extremely important in order to prevent misunderstandings, fruitless work, or feelings of dread and remorse.

Internal struggles as an ML engineer

As an ML engineer, you’ll have to learn to juggle your thoughts about wanting to fix things and make processes better against what needs to get done right now. You’ll have to learn acceptance in taking the most immediate route to getting things done. For example, I’d love to have spent time on improving a third-party’s training/evaluation code, but at the time, I just needed to take the shortest path to see if our inference results would improve.

Coming from a web development background, I mostly had to write my own code, but in ML engineering, I mostly have to learn how to apply other people’s code. When you’re constantly working with code that is not your own — often code that students and researchers have spent months or years researching — sometimes it’s hard not to feel lost or incompetent, especially if you wander too much into trying to understand aspects of their code that don’t directly have to do with deploying their models into your production system.

At the end of the day, just keep in mind that we’re curious creatures by nature, that it’s alright to want to learn more than you need to. If there are avenues you want to explore, transparency with your teammates is key. A nontoxic work environment would not reprimand you for wanting to learn more if you’re meeting goals in a timely enough manner. As long as your priorities are in sight, try to worry less and have fun.


Building ML processes for production systems is not straightforward. Despite all the considerations mentioned in this post, at times, the best decision you can make is to simply make a decision. And if it doesn’t work out, move on to the next path. Regardless, hopefully this post has helped shed some light on the various considerations.

What are some considerations in building ML processes for production systems that you feel strongly about? I’m sure there are many that we’ve missed, including data gathering, cleaning, and preprocessing. Please feel free to send us your thoughts at!

- Marifel