- What bias means in the field of ML
- How it is naturally addressed in the theoretical mathematics that help ensure ML models can
*generalize*—learn, rather than simply memorize

*The danger doesn’t lie in the existence of bias. The potential danger lies in hidden bias.*## Bias is Everywhere. And It’s Not Inherently Negative

The term*bias*carries a rather negative connotation as something that can marginalize certain subpopulations. Retrospectives of various medical studies have revealed many cases of negative bias unfairly affecting patients. As the medical field and our society in general become increasingly aware of the existence of these biases, there has been more attention given to controlling and accounting for them. From such a mindset, extreme caution regarding the use of any complex system in healthcare is advisable. Sometimes bias is defined as a factor that leads to a conclusion other than the truth, or even as an

*error.*But that’s misleading in the context of machine learning applied to healthcare. Bias can just as easily lead an ML solution to the truth, as opposed to away from it. Biases are not inherently bad, and in fact, they are impossible to escape. The primary concern is often that hidden biases in data—whether in a scientific experiment, a medical study, or an intelligent system—will disadvantage a portion of a patient population, leading to worse healthcare outcomes for such groups. Every decision tool, and indeed every decision made by a human being, has biases encoded in it. Therefore, rather than eliminate biases, we have to understand them and find ones that are more universally acceptable, almost like agreeing on a set of axioms in a mathematical proof. From that principled perspective, it is important to understand how bias is considered in the development of ML systems and how it is encoded in ML models. Owing to the critical nature of bias in creating a successful machine learning model, it is one of the few scientific disciplines most directly focused on understanding bias and controlling for it. Why is this the case? It’s complicated. So, let me walk through a few key points.

#### 1. It is important to define the term *generalization *as it relates to machine learning.

Similar to its every-day definition, in ML generalization occurs when, based on limited knowledge, we infer concepts that apply more broadly, such that we can then understand and predict phenomena where our knowledge is incomplete. For example, if we look at pictures of ants in a textbook, we can learn to correctly identify the insect when we see it in our garden. We correctly classify the insect as an “ant,” even though we’ve never seen that particular ant in that exact context before.
#### 2. Consider that machine learning’s theoretical core principles are focused on optimizing *generalization performance*.

Therefore, simply based on the idea of generalization, we can see that there is going to be bias involved in any complex ML task because we are forced to build a model from incomplete knowledge.
The same is true of any complex mental model that a human being constructs, because we are forced to make inferences in an uncertain world. Racism and other forms of generalization that disenfranchise subpopulations have led to *bias*having a bad connotation in the vernacular. And although it is well understood that generalizations are often based on real data, bias in data and in mathematical models can lead to incorrect results and bad outcomes.

#### 3. It is important to understand the differences between things like pattern recognition and data fitting versus machine learning.

Many scientific disciplines that use statistical or other mathematical forms of data fitting try to find a data model that explains the observed data perfectly or fits the data most tightly.*Machine learning*focuses on explaining the data while simultaneously figuring out ways to ensure this explanation generalizes well to examples that the model has never seen before. One reason that machine learning has exploded in the past couple decades is the fact that it is good at handling extremely complex problems. It is good at this because it was designed for decision making under uncertainty.

**How do we find a model that fits the data well while also providing value at explaining the real phenomenon, as opposed to simply being able to make correct predictions on data that it was trained on? Well, the short answer is in the**

*ML excels in complex problem spaces given finite amounts of data, where it is understood that there are often countless models that could perfectly explain the subset of the data that you actually have access to, or at least explain it well.**bias*we encode in the model.

#### 4. It is important to understand that there is no universal learner; at least not in terms of how we define a *learner* as a mathematical construct in ML.

More specifically, if we build a machine learning algorithm that can infer a model of the data from a training set, then that algorithm cannot be used to learn every known phenomenon in the universe perfectly based on finite data. This doesn’t necessarily mean that we can’t build general AI, but it might limit the ways in which general AI can be achieved.
We can’t pick one ML model type, set up its learning parameters, and then give it any kind of data expecting it to be successful. The simple act of choosing a model type and setting some of the specifics of the learning algorithm eliminates the ability to find accurate models of some phenomena in the universe.
In designing an ML system you need to have in mind how you want your model to function—the specific problem you want to solve—and in what contexts you expect it to perform.
Every ML designer should try to understand the bias in the data in as much detail as possible, but he or she must also try to understand the assumptions under which the model is expected to operate. If designers try to keep a model form that can learn anything well, then they cannot expect the learned model to exhibit good generalization performance.
Rather than eliminate bias, the goal is to select the *right*bias, so that the model that is learned is one that doesn’t simply match the data it is given during training, but also reflects the real phenomena under study. In this way,

*bias*actually makes it much more likely that the ML system will learn the truth.

## ML as a Window into Bias

In terms of understanding bias in data or bias in a study, one would hope that such rigor would be applied in every scientific domain, but the consequences of not doing so when designing and applying an ML algorithm are so dramatic that failure in complex real-world tasks is pretty much guaranteed. The only case where that is not true is when using someone else’s design, where they’ve already essentially solved the problem of encoding expertise (which is also*bias*) into the model so that it is capable of learning a

*correct*representation from the data, as opposed to just

*fitting*the data. In understanding the critical nature of bias in the design of an ML system, one can hopefully see why routine application of ML methods in healthcare applications and studies could potentially lead to a much better understanding of:

- What biases are being assumed
- What biases are contained in the data
- What unknowns are still not accounted for

Follow Lirio on Facebook: Facebook.com/lirio.llc, LinkedIn: LinkedIn.com/company/lirio, and Twitter: @Lirio_LLC.

**Other readers viewed:**How Health Systems Can Prioritize Employee Mental Well-being Master Your Mental Wellness Support Strategy for Open Enrollment