Guide to the HUMAN Protocol SDK's
Last updated
© 2023 HPF. HUMAN Protocol® is a registered trademark
Last updated
In a world increasingly driven by data, the question of reliability often takes center stage. The newest additions to the HUMAN Protocol SDK offer robust metrics for assessing data quality. Let's delve into why this is crucial and how these tools can be your game-changer.
Inter Annotator Agreement (IAA) is a pivotal metric for assessing the consistency of annotated data. It measures the agreement level among different annotators when labeling data points.
High IAA score (above 0.8): Signifies consistent labeling and trustworthy data. Annotators are in sync with their evaluations.
Low IAA score (below 0.4): Indicates potential inconsistencies or discrepancies in the annotation process. Data with such scores demands a thorough review.
In a bustling coffee shop, two new drinks are introduced to the menu: the ‘Mocha Marvel’ and the ‘Latte Love’. Three baristas, Alice, Bob, and Carol, are hired to craft these beverages.
To assess their precision and understanding of the new drinks, a small experiment is conducted. Four drinks, a mix of both the 'Mocha Marvel' and 'Latte Love', are prepared and placed on the counter. The task assigned to the trio is straightforward: Identify the type of each drink.
Here's a breakdown of their assessments:
To gauge the consistency and accuracy of the baristas' labeling, a tally of the instances where they concurred on the drink types is compiled:
Upon analyzing the results, it's observed that the baristas reached a consensus on the drink labels 70% of the time. While this isn't flawless, it's a commendable beginning for a coffee establishment. An interesting observation is Bob's uncertainty, which might indicate a lack of engagement, potentially affecting the overall agreement percentage.
Drawing a parallel, a 70% consensus or a score of 0.7 might be satisfactory in a coffee context. However, when this scenario is juxtaposed with more critical domains—such as determining medical dosages, executing surgical procedures, or overseeing nuclear reactor functionalities—a 70% consensus would spell disaster. In these high-stakes environments, striving for a score of 0.8 or even higher becomes imperative to guarantee precision and safety.
Yet, there's a twist in the tale. Even a seemingly high consensus rate can sometimes be deceptive.
Imagine if the baristas, instead of genuinely discerning the flavors, resorted to a coin toss to label the drinks as ‘Mocha Marvel’ or ‘Latte Love’. Statistically, they would agree about 50% of the time. But what if one of the drinks is a seasonal special and much rarer? The problem becomes even more pronounced.
Revisiting the coin toss analogy, the complexity escalates when probabilities are imbalanced.
Consider the realm of email filtering, where a staggering 99% of incoming mails are spam. If filters indiscriminately label nearly everything as spam, there'd be a 98% consensus.
In such a context, even a random selector could mirror this high agreement rate. This doesn't necessarily vouch for the efficiency of the filters; it might merely indicate their collective inefficiency.
These measures consider the likelihood of random agreement, giving you a more accurate picture of how well your annotators—or in our example, email filters—are performing. It's like having a built-in lie detector for your data.
But what data does it rely on? Well, let’s find out.
Now that we've established why chance-corrected measures are the gold standard, let's delve into one of the most commonly used ones: Fleiss' Kappa.
Before we break down the score, let’s break down the formula used to calculate Fleiss’Kappa:
K =(Agreementoberved- Agreementexpected) (1-Aagreementexpected)
Agreementoberved represents the percentage of agreement among the annotators.
Agreementexpected is calculated based on the distribution of the labels assigned.
In simpler terms, 'Agreementoberved' is what happened, while 'Agreementexpected' is what should happen if everyone was randomly guessing.
Now that we've demystified the formula, let's dive into what these numbers actually mean for us.
Recall the trio: Alice, Bob, and Carol, the diligent baristas? Envision Fleiss' Kappa as the metric gauging their coffee-making prowess.
The score can range from -1 to 1:
A score of 1 is like every cup of coffee being perfect—just the right blend of beans, milk, and love. The customers are happy, and Alice, Bob and Carol are in sync.
A score of -1 is like a disastrous day at the café—orders are wrong, glasses are broken, and the temperature is either too hot or too cold.
A score of 0 is like the baristas randomly pressing buttons on the coffee machine—the outcome is as unpredictable as a roll of dice.
A score that dips below 0.4 is typically an alarm bell. Drawing a parallel to a Yelp review for a café, a score under 40% is akin to a cautionary tale. In such scenarios, it might be prudent to contemplate refining the skills of Alice, Bob, and Carol. The industry standard, as suggested by Krippendorff in 2013, is that reasonably good levels of agreement start at a value of 0.8.
If the owner is hitting those numbers, he can be pretty confident that Alice, Bob, and Carol are not only making great coffee but are also in tune with each other's work.
It is now time to test the baristas and see how they’re doing. Recall the annotation table where Alice, Bob, and Carol labeled four different drinks? If we sum up all their labels (excluding the 'I don’t know'), we get a total of 11 annotations:
5 as ‘Mocha Marvel’
6 as ‘Latte Love’
Given these numbers, the probability of assigning ‘Mocha Marvel’ is p(Mocha Marvel) = 611, or approximately 54.5%. Similarly, the probability for ‘Latte Love’ is p(Latte Love)=511, so 45.5%
Now, let’s consider the likelihood that the two labels would match purely by chance. This means that if two baristas were to randomly label a drink with the same label.
The expected agreement is Agreementexpected= p(Mocha Marvel)2+ p(Latte Love)2 , which in our case is 50.4%, or 50% for simplicity.
So, our Fleiss” Kappa Score for this annotation would be:
K= (0.7-0.5)(1-0.5)= 0.4
This value is lower than the agreement percentage and falls well below the acceptable levels. In other words, chance-corrected agreements like Fleiss’ Kappa are more conservative than simple percentage agreements.
Having understood the grounding of Fleiss' Kappa in assessing data quality, one might be curious about forecasting the quality for an ongoing annotation task. This is where the concept of Confidence Intervals steps in!
Confidence Intervals serve as a predictive tool, much akin to a weather forecast. Just as a 90% rain prediction signals the advisability of an umbrella, Confidence Intervals provide insights into the probable performance of baristas as they progress in their roles.
Choose a confidence level and number of bootstrap datasets: The initial step involves determining the desired confidence level. Typically, this percentage ranges from 90% to 99.9%, contingent on the task's criticality.
Generate bootstrapped datasets: The original data undergoes resampling to produce new datasets. Each resampled set then has its agreement score calculated. Increasing the number of iterations improves the accuracy of the confidence interval, but it also requires more processing time. A general guideline suggests that around 1000 iterations strike a good balance between precision and computational efficiency.
Discard outliers: Depending on the predetermined confidence level, the highest and lowest x% of values are discarded. The values that remain constitute the Confidence Interval.
Now, let’s move on to the nitty-gritty details of the algorithm.
A week has elapsed since Alice, Bob, and Carol embarked on their café journey, and a wealth of data regarding their performance has been amassed. To forecast their future performance, the algorithm is employed in the following manner:
Data collection: The algorithm initiates by collating all the coffee orders processed by Alice, Bob, and Carol throughout the week. In this dataset, every individual order corresponds to a single row.
Data resampling: The original dataset undergoes a resampling process. This entails the creation of a new dataset of equivalent size, populated with random orders extracted from the original set.
Agreement score calculation: Each of the newly formed datasets undergoes an evaluation to determine the agreement score. This score represents the frequency with which Alice, Bob, and Carol reached a consensus on the quality of each coffee order.
By doing this multiple times, the algorithm generates a range of possible agreement scores. This is like simulating different weeks of café operation to see how well Alice, Bob, and Carol might agree in different scenarios.
After sorting these scores, an easy way to construct the confidence interval by discarding a percentage of the highest and lowest values, since those are probably outliers. We can calculate the percentage like so: =1- ConfidenceLevel
l= 2, h= 1-2
The remaining list's extremities, the lowest and highest values, define the Confidence Interval. This interval offers a dependable spectrum indicating the probable future agreement consistency among Alice, Bob, and Carol. Thus, a robust statistical framework is established for the café's operations.
With a clear understanding of quality assurance in the café setting—or any annotation endeavor—the stage is set to unveil updates that promise to simplify this procedure for developers.
The next update to the HUMAN Protocol SDK is bringing a game-changer for oracle developers: Inter Annotator Agreement (IAA) measures and Bootstrapped Confidence Intervals. With these tools integrated into the SDK, developers can easily integrate data quality assessments into Oracles, ensuring that Job Requesters receive datasets that match their quality standards.
Let’s see how to achieve this with the SDK.
The agreement module is currently an optional extra of the Human Protocol SDK and can be installed as follows:
After that, we can calculate the agreement using a simple function call:
The agreement function takes care of all necessary data transformation and function calls. To switch to Fleiss Kappa in our code example, all we need to do is change the method parameter.
Now let’s continue with the confidence interval:
And that’s all!
With the latest update to the HUMAN Protocol SDK, assessing the reliability of your annotations has never been easier. These new Data Quality features are now an integral part of the SDK, allowing you to keep a close eye on ongoing annotation projects.
Currently, the SDK offers a limited but powerful set of measures, including Cohen’s Kappa, Krippendorff’s Alpha and Sigma, but this is just the beginning. More robust measures are in the pipeline. Stay tuned for future updates, and go ahead to elevate your data game. After all, quality is not just a goal; it's a standard.
Legal Disclaimer
The HUMAN Protocol Foundation makes no representation, warranty, or undertaking, express or implied, as to the accuracy, reliability, completeness, or reasonableness of the information contained here. Any assumptions, opinions, and estimations expressed constitute the HUMAN Protocol Foundation’s judgment as of the time of publishing and are subject to change without notice. Any projection contained within the information presented here is based on a number of assumptions, and there can be no guarantee that any projected outcomes will be achieved.