Data Collection + Evaluation
Last updated
Was this helpful?
Last updated
Was this helpful?
Does our training dataset have the features and breadth to ensure our AI meets our users’ needs?
Should we use an existing training dataset or develop our own?
How can we ensure that raters aren’t injecting error or bias into datasets when generating labels?
Want to drive discussions, speed iteration, and avoid pitfalls? Use the worksheet.
In order to make predictions, AI-driven products must teach their underlying machine learning model to recognize patterns and correlations in data. These data are called training data, and can be collections of images, videos, text, audio and more. You can use existing data sources or collect new data expressly to train your system. For example, you might use a database of dog images from a shelter to train an ML model to recognize common dog breeds.
The training data you source or collect, and how those data are labeled, directly determines the output of your system — and the quality of the user experience. Once you’re sure that using AI is indeed the right path for your product (see User Needs + Defining Success) consider the following:
➀ Translate user needs into data needs. Determine the type of data needed to train your model. You’ll need to consider predictive power, relevance, fairness, privacy, and security.
➁ Source your data responsibly. Whether using pre-labeled data or collecting your own, it’s critical to evaluate your data and their collection method to ensure they’re appropriate for your project.
➂ Design for raters & labeling. For supervised learning, having accurate data labels is crucial to getting useful output from your model. Thoughtful design of rater instructions and UI flows will help yield better quality labels and therefore better output.
➃ Tune your model. Once your model is running, interpret the ML output to ensure it’s aligned with product goals and user needs. If it’s not, then troubleshoot: explore potential issues with your data.
Datasets that can be used to train AI models contain examples, which contain one or more features, and possibly labels.
The scope of features, the quality of the labels, and representativeness of the examples in your training dataset are all factors that affect the quality of your AI system.
The table above contains data about races that an app could use to train an ML model to predict how enjoyable a given race will be. Here’s how examples, features and labels could affect the quality of that model:
If examples used to train the run recommendation algorithm only come from elite runners, then they would likely not be useful in creating an effective model to make predictions for a wider user base. However, they may be useful in creating a model geared towards elite runners.
If the elevation gain feature was missing from the dataset, then the ML model would treat a 3.0 mile uphill run equally to a 3.0 downhill mile run, even though the human experience of these is vastly different.
Labels that reveal the subjective experience of the runners are necessary to help the system to identify the features that are most likely to result in a fun run.
When deciding which examples, features, and labels are needed to train your ML model, work through your data needs on a conceptual level, as shown in the example below.
The example shows the data needs breakdown for a product that aims to solve the user need of “I want to fit runs into my busy schedule.”
User: Runners
User need: Run more often
User action: Complete runs with or while using the app
ML system output: What routes to suggest and when to suggest them
ML system learning: Patterns of behavior around accepting run prompts, completing runs, and improving consistency
Datasets needed: Running data from the app, demographic data, physiological data, and local geographic data
Key features needed in dataset: Runner demographics, time of day, run completion rate, pace, distance ran, elevation gained, heart rate
Key labels needed in dataset: Runner acceptance or rejection of app suggestion, user generated feedback as to why suggestion was rejected and enjoyment of recommended runs
Once you have an idea of the type of data you will need, use Google’s AI Principles and Responsible AI Practices as a framework to work through key considerations, including the ones described below.
As with any product, protecting user privacy and security is essential. Even in the running-related example above, the physiological and demographic data required to train this model could be considered sensitive.
There are a number of important questions that arise due to the unique nature of AI and machine learning. Below are two such questions, but you should discuss these and others with privacy and security experts on your team.
What limits exist around user consent for data use?
When collecting data, a best practice, and a legal requirement in many countries, is to give users as much control as possible over what data the system can use and how data can be used. You may need to provide users the ability to opt out or delete their account. Ensure your system is built to accommodate this.
Is there a risk of inadvertently revealing user data? What would the consequences be?
For example, though an individual’s health data might be private and secure, if an AI assistant reminds the user to take medication through a home smart speaker, this could partially reveal private medical data to others who might be in the room.
check_circle_outline Aim for
Take extra steps to protect privacy (anonymize names for example, even if people agreed to have their name used in community reviews) when personal details (such as where people live) could be exposed as part of AI recommendations or predictions. Learn more
not_interested Avoid
Don’t assume basic data policies are enough to protect personal privacy. In this case, the runner agreed to expose her name in community reviews, but because she often starts runs from the same spot, another user could infer where she lives.
To build products to work in one context, use datasets that are expected to reliably reflect that context. For example, for a natural language understanding model meant to work for speech, it wouldn’t be helpful to use words that users type into a search engine as training data — people don’t type searches the same way they talk.
If your training data aren’t properly suited to the context, you also increase the risk of overfitting or underfitting your training set. Overfitting means the ML model is tailored too specifically to the training data, and it can stem from a variety of causes. If an ML model has overfit the training data, it can make great predictions on the training data but performs worse on the test set or when given new data.
Models can also make poor predictions due to underfitting, where a model hasn’t properly captured the complexity of the relationships among the training dataset features and therefore can’t make good predictions with training data or with new data.
There are many resources that can help the software engineers and research scientists on your team with understanding the nuances of training ML models so you can avoid overfitting and underfitting, for example these from Google AI. But first, involve everyone on your product team in a conceptual discussion about the examples, features, and labels that are likely required for a good training set. Then, talk about which features are likely to be most important based on user needs.
At every stage of development, human bias can be introduced into the ML model. Data is collected in the real world, from humans, and reflects their personal experiences and biases — and these patterns can be implicitly identified and amplified by the ML model.
While the guidebook provides some advice related to ML fairness, it is not an exhaustive resource on the topic. Addressing fairness in AI, and minimizing unfair bias, is an active area of research. See Google’s Responsible AI Practices for recent ML fairness guidance and recommended practices.
Here are some examples of how ML systems can fail users:
Representational harm, when a system amplifies or reflects negative stereotypes about particular groups
Opportunity denial, when systems make predictions and decisions that have real-life consequences and lasting impacts on individuals’ access to opportunities, resources, and overall quality of life
Disproportionate product failure, when a product doesn’t work or gives skewed outputs more frequently for certain groups of users
Harm by disadvantage, when a system infers disadvantageous associations between certain demographic characteristics and user behaviors or interests
While there is no standard definition of fairness, and the fairness of your model may vary based on the situation, there are steps you can take to mitigate problematic biases in your dataset.
Your training data should reflect the diversity and cultural context of the people who will use it. Use tools like Facets to explore your dataset and better understand its biases. In doing so, note that to properly train your model, you might need to collect data from equal proportions of different user groups that might not exist in equal proportions in the real world. For example, to have speech recognition software work equally on all users in the United States, the training dataset might need to contain 50% of data from from non-native English speakers even if they are a minority of the population.
There is no such thing as truly neutral data. Even in a simple image, the equipment and lighting used shapes the outcome. Moreover, humans are involved with data collection and evaluation, and so, as with any human endeavor, their output will include human bias. See more in the section on labeling below.
For example, say you’re creating a recommendation system to recommend new health and fitness goals to users. If the intent is to set goals that are safely achievable by users with a wide range of baseline fitness levels, it’s important that the training dataset includes data from a variety of user types and not just young, healthy people.
For more on the topic of fairness, see Google’s Machine Learning Fairness Overview and Crash Course.
Once your team has a high-level understanding of the data your product needs, work through your own translation between specific user needs and the data needed to produce them.
Try to be as specific as possible during this step. This will have a direct impact on which user experiences your team decides to spend your resources on going forward.
Apply the concepts from this section using Exercise 1 in the worksheet
Once you’ve identified the type of training data you need, you will figure out how and where to get it. This could mean using an existing dataset, collecting your own data, or a combination of the two. Make sure that whatever you decide, you have permission to use this data and the infrastructure to keep it safe.
It may not be possible to build your dataset from scratch. As an alternative, you may need to use existing data from sources such as Google Cloud AutoML, Google Dataset Search, Google AI datasets, or Kaggle. If you’re considering supervised learning, this data may be pre-labeled or you may need to add labels (see more on labeling, below). Be sure to check the terms of use for the dataset and consider whether it’s appropriate for your use case.
Before using an existing dataset, take the time to thoroughly explore it using tools like Facets to better understand any gaps or biases. Real data is often messy, so you should expect to spend a fair amount of time cleaning it up. During this process, you may detect issues, such as missing values, misspellings, and incorrect formatting. For more information on data preparation techniques, check out the developer guidelines on Data Preparation. They can help you make sure that this data will be able to help you deliver the user experience you identified at the outset.
When creating your own dataset, it’s wise to start by observing someone who is an expert in the domain your product aims to serve — for example, watching an accountant analyze financial data, or a botanist classify plants. If you can interview them as they think or work through the non-ML solution to the problem, you may be able to pick up some insights into which data they look at when making a decision or before taking an action.
You’ll also want to research available datasets that seem relevant and evaluate the signals available in those datasets. You may need to combine data from multiple sources for your model to have enough information to learn.
Once you’ve gathered potential sources for your dataset, spend some time getting to know your data. You’ll need to go through the following steps:
Identify your data source(s).
Review how often your data source(s) are refreshed.
Inspect the features’ possible values, units, and data types.
Identify any outliers, and investigate whether they’re actual outliers or due to errors in the data.
Understanding where your dataset came from and how it was collected will help you discover potential issues. The following are common dataset issues to look out for.
Real data can be messy! A “zero” value could be an actual measured “0,” or an indicator for a missing measurement. A “country” feature may contain entries in different formats, such as “US,” “USA,” and “United States.”
While a human can spot the meaning just by looking at the data, an ML model learns better from data that is consistently formatted.
If you’re using the output of another ML system as an input feature to train your model, keep in mind that this is a risky data source. Errors associated with this feature will be compounded with your system’s overall error, and the further you are from the original training data, the more difficult it will be to identify error sources.
There’s more information on determining error sources in the chapter on Errors + Graceful Failure.
No matter what data you’re using, it’s possible that it could contain personally identifiable information. Some approaches to anonymizing data include aggregation and redaction. However, even these approaches may not be able to completely anonymize your data in all circumstances, so consider consulting an expert.
Aggregation is the process of replacing unique values with a summary value. For example, you may replace a list of a user’s maximum heartbeats per minute from every day of the month with a single value: their average beats per minute or a categorical high / medium / low label.
Redaction removes some data to create a less complete picture. Such anonymization approaches aim to reduce the number of features available for identifying a single user.
And finally, you’ll need to split the data into training and test sets. The model is trained to learn from the training data, and then evaluated with the test data. Test sets are data that your model hasn’t seen before — this is how you’ll find out if, and how well, your model works. The split will depend on factors such as the number of examples in your dataset and the data distribution.
The training set needs to be large enough to successfully teach your model, and your test set should be large enough that you can adequately assess your model’s performance. This is usually the time when developers realize that adequate data can make or break the success of a model. So take the time to determine the most efficient split percentage. A typical split of your dataset could result in: 60% for training, and 40% for testing. This lab from Google AI offers more details on data splitting.
Once your model has been trained and your product is being used in the real world, you can start collecting data in your product to continually improve your ML model. How you collect this data has a direct impact on its quality. Data can be collected implicitly in the background of user activity within your app or explicitly when you ask users directly. There are different design considerations for each, which are covered in depth in the Feedback + Control chapter.
ML-driven products require a lot of data to work, and this is often where product teams falter. Getting enough data to both train and test your ML model is critical to delivering a functional product. To get you started, answer the key questions below:
If you need to create a new dataset, how are you planning to collect the data?
If you have an existing dataset, what, if any alterations or additions need to be made for your user population?
Apply the concepts from this section using Exercise 2 in the worksheet
For supervised learning, accurate data labels are a crucial ingredient for achieving relevant ML output. Labels can be added through automated processes or by people called raters. “Raters” is a generic term that covers a wide variety of contexts, skillsets, and levels of specialization. Raters could be:
Your users: providing “derived” labels within your product, for example through actions like tagging photos
Generalists: adding labels to a wide variety of data through crowd-sourcing tools
Trained subject matter experts: using specialized tools to label things like medical images
If people understand what you’re asking them to label, and why, and they have the tools to do so effectively, then they’re more likely to label it correctly.
Key considerations when designing for labeling:
Ensure rater pool diversity
Think about the perspectives and potential biases of the people in your pool, how to balance these with diversity, and how those points of view could impact the quality of the labels. In some cases, providing raters with training to make them aware of unconscious bias has been effective in reducing biases.
Investigate rater context and incentives
Think through the rater experience and how and why they are doing this task. There’s always a risk that they might complete the task incorrectly due to issues like boredom, repetition, or poor incentive design.
Evaluate rater tools
Tools for labeling can range from in-product prompts to specialized software. When soliciting labels in-product, make sure to design the UI in a way that makes it easy for users to provide correct information. When building tools for professional raters, the article First: Raters offers some useful recommendations like the ones below:
Use multiple shortcuts to optimize key flows. This helps raters move fast and stay efficient.
Provide easy access to labels. The full set of available labels should be visible and available to raters for each item they are asked to address. It should be fast and easy in the UI to apply the labels.
Let raters change their minds. Labeling can be complicated. Offer a flexible workflow and support for editing and out-of-sequence changes so that raters can seek second opinions and correct errors.
Auto-detect and display errors. Make it easy to avoid accidental errors with checks and flags.
Once you’ve collected data from raters, you’ll need to conduct statistical tests to analyze inter-rater reliability. A lack of reliability could be a sign that you have poorly-designed instructions.
check_circle_outline Aim for
Make rater instructions as specific and simple as possible. Here, the phrase “running shoes” easily rules out the selection of other athletic shoe types like soccer cleats. Learn more
not_interested Avoid
Don’t use instructions that can be interpreted multiple ways. Here, someone’s subjective definition of “athletic” might or might not include dance shoes, for example.
Before designing tools for your raters, research their needs the same way you would think about your end-users. Their motivation and ability to do their job well has a direct impact on everything else you build down the line.
Who are your raters?
What is their context and incentive?
What tools are they using?
Apply the concepts from this section in Exercise 3 in the worksheet
Once your model has been trained with your training data, evaluate the output to assess whether it’s addressing your target user need according to the success metrics you defined. If not, you’ll need to tune it accordingly. Tuning can mean adjusting the hyperparameters of your training process, the parameters of the model or your reward function, or troubleshooting your training data.
To evaluate your model:
Use tools like the What-If tool to inspect your model and identify blindspots.
Test, test, test on an ongoing basis.
In early phases of development, get in-depth qualitative feedback with a diverse set of users from your target audience to find any “red flag” issues with your training dataset or your model tuning.
As part of testing, ensure you’ve built appropriate and thoughtful mechanisms for user feedback. See more guidance for this in the Feedback + Control chapter.
You may need to build custom dashboards and data visualizations to monitor user experience with your system.
Be particularly careful to check for secondary effects that you may not have anticipated when determining your reward function, a concept covered in the User Needs + Defining Success chapter.
Try to tie model changes to a clear metric of the subjective user experience like customer satisfaction, or how often users accept a model’s recommendations.
Once you’ve identified issues that need to be corrected, you’ll need to map them back to specific data features and labels (or lack thereof), or model parameters. This may not be easy or straightforward. Resolving the problem could involve steps like adjusting the training data distribution, fixing a labeling issue or gathering more relevant data. Here’s a hypothetical example of how a team might tackle tuning:
Let’s say our running app was launching a new feature to calculate calories burned and make recommendations for mid-run changes to help users burn their target number of calories.
During beta testing, it was observed that users receiving these recommendations were far more likely than other users to quit mid-run. Moreover, users who followed the recommendations and completed the run and were less likely to return for a second run within the same week. The product manager originally assumed that this feature was a failure, but after user interviews and a deeper look at the data, it turned out that the algorithm wasn’t properly weighting important data when calculating estimated calorie burn like the outside temperature and a user’s weight.
After some user research, it became clear that some users were quitting because they didn’t trust the calorie calculation and therefore didn’t see the point in accepting the app’s recommendations for mid-run changes.
The engineering team was able to re-tune the algorithm and launch a successful feature.
Tuning is an ongoing process for adjusting your ML model in response to user feedback and issues that arise due to unforeseen circumstances. Tuning never stops, but it is especially important in the early phases of development.
What is our plan for doing early testing of our model?
Is our set of early beta users diverse enough to properly test our model?
What metrics will we use to determine if our tuning is successful?
Apply the concepts from this section on tuning by exploring the What-if tool. Explore more about tuning models in response to user feedback in the Feedback + Control chapter.
Data is the bedrock of any ML system. Having responsibly sourced data, from a relevant context, checked for problematic bias will help you build better systems and therefore more effectively address user needs. Key considerations for data collection and evaluation:
➀ Translate user needs into data needs. Think carefully as a cross-functional team about what features, labels, and examples you will need to train an effective AI model. Work systematically to break down user needs, user actions, and ML predictions into the necessary datasets. As you identify potential datasets, or formulate a plan to collect them, you’ll need to be diligent about inspecting the data, identifying potential bias sources, and designing the data collection methods.
➁ Source your data responsibly. As part of sourcing data, you’ll need to consider relevance, fairness, privacy, and security. You can find more information in Google’s AI Principles and Responsible AI Practices. These apply whether you are using an existing dataset or building a new training dataset.
➂ Design for raters & labeling. Correctly labeled data is a crucial ingredient to an effective supervised ML system. Thoughtful consideration of your raters and the tools they’ll be using will help ensure your labels are accurate.
➃ Tune your model. Once you have a model, you will need to test and tune it rigorously. The tuning phase involves not only adjusting the parameters of your model, but also inspecting your data – in many cases, output errors can be traced to problems in your data.
Want to drive discussions, speed iteration, and avoid pitfalls? Use the worksheet
In addition to the academic and industry references listed below, recommendations, best practices, and examples in the People + AI Guidebook draw from dozens of Google user research studies and design explorations. The details of these are proprietary, so they are not included in this list.
Alexsoft. (2018, March 29). How to Organize Data Labeling for Machine Learning: Approaches and Tools.
Alexsoft. (2017, June 16). Preparing Your Dataset for Machine Learning: 8 Basic Techniques That Make Your Data Better
Amazon Machine Learning Developer Guide. (2019). Amazon Web Services.
G, Y. (2017, August 31). The 7 Steps of Machine Learning.
Giffin, D., Levy, A., Stefan, D., Terei, D., Mazières, D., Mitchell, J., & Russo, A. (2017). Hails: Protecting data privacy in untrusted web applications. Journal of Computer Security, 25(4-5), 427-461.
Korolov, M. (2018, February 13). AI’s biggest risk factor: Data gone wrong.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., … Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* 19.
Rolfe, R., & May, S. (2018, June 22). First: Raters Designing for AI’s unseen users can lead to better products in the long run.
Seif, G. (2018, July 6). How to collect your deep learning dataset
Shapiro, D. (2017, November 6). Artificial Intelligence and Bad Data
Shapiro, D. (2017, September 19). Artificial Intelligence: Get your users to label your data
Smith, D. (2019, January 29). What is AI Training Data?
Teltzrow, M., & Kobsa, A. (2004). Impacts of User Privacy Preferences on Personalized Systems. Designing Personalized User Experiences in ECommerce Human-Computer Interaction Series, pp.315-332.
Yang, Q., Scuito, A., Zimmerman, J., Forlizzi, J., & Steinfeld, A. (2018). Investigating How Experienced UX Designers Effectively Work with Machine Learning. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS 18.