This tutorial will serve as a basic guide to data labeling, building simple labeling functions for classifying youtube comments as ham and spam. Additionally, it will also show how to integrate MLflow into this, to keep track of all the models.
By default, you can load up the dataset directly by cloning Snorkel Tutorials and writing from utils import load_spam_dataset
It downloads the raw CSV files from the internet, divides them into splits, converts them into DataFrames, and shuffles them. The dataset contains comments from 5 of the most popular YouTube videos during a period between 2014 and 2015. But because of the Linux filesystem, it doesn't work. So I have created a function that replicates the same job. Download the data set from here
import pandas as pd
import glob
def data_loader():
path = r'directory of dataset'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame = shuffle(frame)
train = frame.iloc[0:1369,:-1]
dev = frame[500:900]
validation = frame[1370:1663]
test = frame[1664:]
return test,train,dev,validation
The splits are as explained below:
#For clarity, we define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = -1
HAM = 0
SPAM = 1
Introduction to Labelling functionsLabeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically. LFs are heuristics that take as input a data point and either assign a label to it (in this case, HAM or SPAM) or abstain (don't assign any label). Labeling functions can be noisy: they don't have perfect accuracy and don't have to label every data point. Moreover, different labeling functions can overlap (label the same data point) and even conflict (assign different labels to the same data point). There are many ways (as explained above) to determine labeling functions. Here I have shown two of the ways:@labeling_functiondecorator. The decorator can be applied to any Python function that returns a label for a single data point. Here is one example python
from snorkel.labeling import labeling_function
@labeling_function()
def check(x):
return SPAM if "check" in x.text.lower() else ABSTAIN
which checks the string for this particular word. Likewise, usually, you define certain keywords. But instead of writing redundant functions we create a labeling function class which takes up keywords so we don't have to manually write functions for each keyword.
from snorkel.labeling import LabelingFunction
#This is Labelling function class not labeling FUNCTION!
def keyword_lookup(x, keywords, label):
if any(word in x.CONTENT.lower() for word in keywords):
return label
return ABSTAIN
def make_keyword_lf(keywords, label=SPAM):
return LabelingFunction(
name=f"keyword_{keywords[0]}",
f=keyword_lookup,
resources=dict(keywords=keywords, label=label),
)
"""Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])
"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])
"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])
"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])
"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)
Here is one heuristic labelling function
@labeling_function()
def short_comment(x):
"""Ham comments are often short, such as 'cool video!'"""
return HAM if len(x.CONTENT.split()) < 5 else ABSTAIN
Here we also use MLflow to define those labelling functions as parameters.
with mlflow.start_run():
mlflow.log_param("keywords LF","my, subscribe, link, please, song")
mlflow.log_param("heuristic LF", "short comment")
from snorkel.labeling import PandasLFApplier
lfs = [
keyword_my,
keyword_subscribe,
keyword_link,
keyword_please,
keyword_song,
short_comment,
]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)
Snorkel provides a tool for common LF analysis to calculate statistics like coverage of these LFs (i.e., the percentage of the dataset that they label). Apart from that, we get:
Here, shown one of the instance
from snorkel.labeling import LFAnalysis
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
| Tables | J | COVERAGE | OVERLAPS | CONFLICTS | CORRECT | INCORRECT | EMP. ACC.|
| -------------:|:-:|----------:|----------:|----------:|--------:|----------:|---------:|
| CHECK_OUT | 0 | 0.22 | 0.22 | 0.1 | 22 | 0 |1.000000 |
| CHECK | 1 | 0.30 | 0.22 | 0.1 | 29 | 1 |0.966667 |
Further to improve our labeling functions, snorkel provides a helper method get_label_buckets(...) groups data points by their predicted label and true label. For example, we can find the indices of data points that the LF labeled SPAM that belong to class HAM. This may give ideas for where the LF could be made more specific. ( Note that we can only do if we have gold-labeled dev set available!)
from snorkel.analysis import get_label_buckets
buckets = get_label_buckets(Y_dev, L_dev[:, 1])
df_dev.iloc[buckets[(HAM, SPAM)]]
''' Here the tutorial performs an additional test, make buckets for training dataset to check if the intuition
was correct or not. In real-world, that wouldn't be possible most of the time!'''
| Author | Date | Text | Label | Video |
| Eanna Cusack | 2014-01-20T22:20:59 | Im just to check how much views it has | 0 | 1 |
Other complex ways to write labeling functions include (but not limited to) regular expression, preprocessors(Textblob), Complex preprocessors(NLP).
We convert these functions into noise aware probabilistic functions i.e. Take the consideration of all the label functions of what they think the label is and try to reduce noise as much as possible (or confidence-weighted) label per data point. A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM than HAM, label it SPAM (and vice versa).
from snorkel.labeling import MajorityLabelVoter
majority_acc = majority_model.score(L=L_valid, Y=validation.iloc[:,-1])["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
mlflow.log_metric('Majority_Vote_Acc',majority_acc * 100)
Another way to convert these functions is to use LabelModel. The LabelModel can learn weights for the labeling functions using only the label matrix as input. Note that no gold labels are used during the training process. The only information we need is the label matrix, which contains the output of the LFs on our training set.
The output of the Snorkel LabelModel is just a set of labels which can be used with most popular libraries for performing supervised learning, such as TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, and XGBoost. In this tutorial, we demonstrate using classifiers from Keras and Scikit-Learn.
For simplicity and speed, we use a simple "bag of n-grams" feature representation: each data point is represented by a one-hot vector marking in which words or 2-word combinations are present in the comment text.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())
We'll use Keras, a popular high-level API for building models in TensorFlow, to build a simple logistic regression classifier. We compile it with a categorical_crossentropy loss so that it can handle probabilistic labels instead of integer labels. We use the common settings of an Adam optimizer and early stopping (evaluating the model on the validation set after each epoch and reloading the weights from when it achieved the best score). For more information on Keras, see the Keras documentation.
from snorkel.analysis import metric_score
from snorkel.utils import preds_to_probs
from utils import get_keras_logreg, get_keras_early_stopping
# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])
keras_model.fit(
x=X_train,
y=probs_train_filtered,
validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
callbacks=[get_keras_early_stopping()],
epochs=50,
verbose=0,
)
preds_test = keras_model.predict(x=X_test).argmax(axis=1)
test_acc = metric_score(golds=Y_test, preds=preds_test, metric="accuracy")
print(f"Test Accuracy: {test_acc * 100:.1f}%")
mlflow.log_metric("test_accuracy", test_acc*100)
''' Save this keras model in MLflow'''
from mlflow import keras
mlflow.keras.save_model(keras_model,path = f'runs')
NOTE This tutorial is adopted from official Snorkel documentation and tutorial literature and parts of it are directly used. See more here