A data strategy will enable your company to acquire, process, govern and gain value from data effectively. Without a data strategy, your team’s efforts will be greater than necessary, risks will be magnified and chances of success will be reduced.
“Data is the lifeblood of any AI system. Without it, nothing happens” (David Benigson, Signal). There are six components of an effective data strategy (Fig. 9.):
Define your data strategy at the outset of your AI initiative. Review it quarterly and update it as product requirements change, your company grows or you are impacted by new legislation.
Source: MMC Ventures
“Data is the lifeblood of any AI system. Without it, nothing happens.”
David BenigsonSignal
Obtaining data to develop a prototype or train your models can be a lengthy process. Ideally, you will possess all the data you need at the outset and have a data strategy to govern its access and management. In the real world, neither is likely. Working on the project may highlight missing data.
“Build access to data at scale from day one” (David Benigson, Signal). Filling the gaps from your own initiatives can take months, so use multiple approaches to accelerate progress. Developers typically draw on several approaches to source data (Fig. 10) including free resources (such as dataset aggregators), partnerships with third parties and the creation of new, proprietary data.
You will need to de-duplicate and merge your data from multiple sources into a single, consistent store. New data must follow a comparable process so your data remains clean. If you merge fields, or decrease the precision of your data, retain the original data. Being able to analyse gaps in your data will enable you to plan future data acquisition and prioritise addressable business use cases.
“Build access to data at scale from day one.”
David BenigsonSignal
Source: Kaggle
A high quality data set has appropriate characteristics to address your business challenge, minimises bias and offers training data labelled with a high degree of accuracy.
It is important to develop a balanced data set. If you possess significantly more samples of one type of output than another, your AI is likely to exhibit bias. You can decide whether your system’s bias will tend towards false positives or false negatives, but bias will be inevitable. There are three primary forms of bias in AI (Fig. 11):
Source: Victor Lavrenko
Be aware of bias in your data and models to take appropriate action and minimise its impact. Overfitting and underfitting can be adjusted with different data volumes and model structures. Unwanted correlations are frequently more critical to the business; in addition to erroneous results they can lead to negative publicity. Test models thoroughly to ensure that variables that should not affect predictions do not do so. If possible, exclude these ‘protected variables’ from the models completely.
“If you possess significantly more samples of one type of output than another, your AI system is likely to exhibit bias.”
If the features you seek are rare, it can be challenging to achieve a balanced data set. You wish to develop a model that can deal with rare occurrences effectively, but not be overfit. You may be able to use artificial data, but not when the artefacts in the artificial data themselves impact the model. You may also choose to retain some overfit or underfit bias – and opt for a greater proportion of false positives or false negatives. If you err on the side of false positives, one solution is to let a human check the result. The bias you prefer – false positives or false negatives – is likely to depend on your domain. If your system is designed to recognise company logos, missing some classifications may be less problematic than incorrectly identifying others. If identifying cancerous cells in a scan, missing some classifications may be much more problematic than erroneously highlighting areas of concern.
Source: XKCD
It is critical to ensure that the results of your internal testing are maintained when applied to real-world data. 99% accuracy on an internal test is of little value if accuracy falls to 20% when your model is in production. Test early, and frequently, on real- world data. “If you don’t look at real-world data early then you’ll never get something that works in production” (Dr. Janet Bastiman, Chief Science Officer, Storystream). Before you build your model, put aside a ‘test set’ of data that you can guarantee has never been included in the training of your AI system. Most training routines randomly select a percentage of your data to set aside for testing, but over multiple iterations, remaining data can become incorporated in your training set. A test set, that you are sure has never been used, can be reused for every new candidate release. “When we’re looking at images of vehicles, I get the whole company involved. We all go out and take pictures on our phones and save these as our internal test set – so we can be sure they’ve never been in any of the sources we’ve used for training” (Dr. Janet Bastiman, Chief Science Officer, Storystream). Ensure, further, that your ‘test set’ data does not become stale. It should always be representative of the real-world data you are analysing. Update it regularly, and every time you see ‘edge cases’ or examples that your system misclassifies, add them to the test set to enable improvement.
Data scientists report that managing ‘dirty data’ is the most significant challenge they face (Kaggle). Smaller volumes of relevant, well-labelled data will typically enable better model accuracy than large volumes of poor quality data. Ideally, your AI team would be gifted data that is exhaustively labelled with 100% accuracy. In reality, data is typically unlabelled, sparsely labelled or labelled incorrectly. Human-labelled data can still be poorly labelled. Data labelling is frequently crowdsourced and undertaken by non-experts. In some contexts, labelling may also be intrinsically subjective. Further, individuals looking at large volumes of data may experience the phenomenon of visual saturation, missing elements that are present or seeing artefacts that are not. To mitigate these challenges, companies frequently seek data labelled by multiple individuals where a consensus or average has been taken.
To label data effectively, consider the problem you are solving. ‘Identify the item of clothing in this image’, ‘identify the item of clothing in this image and locate its position’ and ‘extract the item of clothing described in this text’ each require different labelling tools. Depending upon the expertise of your data labelling team, you may need a supporting system to accelerate data labelling and maximise its accuracy. Do you wish to limit the team’s labelling options or provide a free choice? Will they locate words, numbers or objects and should they have a highlighter tool to do so?
Embrace existing AI and data techniques to ease the data labelling process:
“If you don’t look at real-world data early then you’ll never get something that works in production.”
Dr Janet BastimanStoryStream
It is critical to understand the data you use. Using a number labelled “score” in your database is impractical – and may be impossible if you do not know how it was derived. Ensure you capture the human knowledge of how data was gathered, so you can make sound downstream decisions regarding data use.
Your data strategy should ensure you:
“Ensure you capture the human knowledge of how data was gathered, so you can make sound downstream decisions regarding data use.”
Understanding the context of your data will depend upon process and documentation more than tooling. Without an understanding of the context in which data was collected, you may be missing nuances and introducing unintended bias. If you are predicting sales of a new soft drink, for example, and combine existing customer feedback with data from a survey you commission, you must ensure you understand how the survey was conducted. Does it reflect the views of a random sample, people in the soft drinks aisle, or people selecting similar drinks?
It is important to understand the information not explicitly expressed in the data you use. Documenting this information will improve your understanding of results when you test your models. Investigating data context should prompt your employees to ask questions – and benefit from their differing perspectives. If you lack diversity in your team, you may lack perspectives you need to identify shortcomings in your data collection methodology. Ensure team members deeply understand your company’s domain as well as its data. Without deeper knowledge of your domain, it can be challenging to know what variables to input to your system and results may be impaired. If predicting sales of computer games, for example, it may be important to consider controversy, uniqueness and strength of fan base in addition to conventional variables.
Your data storage strategy will impact the usability and performance of your data. The nature of your data, its rate of growth and accessibility requirements should inform your approach.
Types of storage include basic file-based, relational and No Structured Query Language (NoSQL):
The store you select will influence the performance and scalability of your system. Consider mixing and matching to meet your needs – for example, a relational database of • individuals with sensitive information linking to data stored in a more accessible NoSQL database. The specific configuration • you choose should depend upon the data types you will store and how you intend to interrogate your data.
“Your data storage strategy will impact the usability and performance of your data.”
One in three data scientists report that access to data is a primary inhibitor of productivity (Kaggle). Data provisioning – making data accessible to employees who need it in an orderly and secure fashion – should be a key component of your data strategy. While best practices vary according to circumstance, consider:
Stale data can be a significantchallenge and is a key consideration when planning your provisioning strategy. If you are analysing rapidly-changing information, decide how much historical data is relevant. You might include all data, a specific volume of data points, or data from a moving window of time. Select an approach appropriate for the problem you are solving. Your strategy may evolve as your solution matures.
If you are correlating actions to time, consider carefully the window for your time series. If you are predicting stock levels, a few months of data will fail to capture seasonal variation. Conversely, if attempting to predict whether an individual’s vital signs are deteriorating, to enable rapid intervention, an individual’s blood pressure last month is likely to be less relevant. Understand whether periodic effects can impact your system and ensure that your models and predictions are based on several cycles of the typical period you are modelling. Pragmatically, ensure your access scripts consider the recency of data you require to minimise ongoing effort.
Data management and security are critical components of a data strategy. Personal data is protected by UK and EU law and you must store it securely.
You may need to encrypt data at rest, as well as when transmitting data between systems. It may be beneficial to separate personal data from your primary data store, so you can apply a higher level of security to it without impacting your team’s access to other data. Note, however, that personal data included in your models, or the inference of protected data through your systems, will fall under data protection legislation.
Establish effective data management by building upon the principles of appropriate storage and minimum required access.
Additionally:
If an individual resigns, or has their employment terminated, immediately revoke access to all sensitive systems including your data. Ensure that employees who leave cannot retain a copy of your data. Data scientists are more likely to try to retain data to finish a problem on which they have been working, or because of their affinity for the data, than for industrial espionage. Neither is an appropriate reason, however, and both contravene data protection law. Ensure your team is aware of the law and that you have appropriate policies in place.