On the evening of November 15, 2021 I eagerly sat at my computer, counting down the seconds until 7pm. As soon as it did, I hit refresh and my heart skipped a beat as I saw my name at the 17th spot on the leaderboard. I got a silver medal in a competition with over 900 teams! I was over the moon with excitement, and I achieved the rank of Kaggle Expert in the competitions category.
I’m sure this is a common experience for Kagglers when they do well in a competition for the first time. This also might be a description of the moment when Kagglers officially become addicted. And as far as addictions go, this one is at least educational, but I’m sure many spouses and partners have complained about how much time Kaggle consumes.
It’s been a couple of months since the competition ended, and I’ve come down from my high enough to reflect on what I learned from the competition. This post is quite detailed so feel free to skip around to what interests you.
Competition Details 🏆
Google Research India hosted the chaii - Hindi and Tamil Question Answering competition to create question answering models for two Indian languages used by millions of people. Hindi is a relatively high resource language, but Tamil is more in the mid-tier category. When there aren’t as many high-quality public datasets or models, it can be hard for the community and industry to make useful AI applications.
Given a passage of text and a question about the passage, a question answering (QA) model should find the span of text that best answers that question. This is an extractive process, so the answer must be in one contiguous section of text. There is a bit of ambiguity between QA and machine reading comprehension (MRC), so click here to read an in-depth explanation about the differences.
The power of SQuAD 💪
SQuAD is a popular dataset for training English QA models, but interestingly enough, training on English SQuAD helps models do better in languages that are nothing like English, such as Hindi and Tamil. Training a model on SQuAD before training on the chaii dataset resulted in significant improvements.
Encoder-only models like RoBERTa were heavily used, and while there were a few attempts to use encoder-decoder models like BART or T5, it seemed like people ultimately decided against it. XLM-RoBERTa (XLM-R), MuRIL, and RemBERT showed up in all of the top team’s solutions and for good reason - they are high-performing multi-lingual models that can be trained on English QA to get better scores in Hindi and Tamil.
The power of multilingual models 🔥
Early on there was a discussion about whether it would be better to make two monolingual models, one for Hindi and one for Tamil, or just a single multilingual model. Quick experiments showed that the multilingual models were superior, and nearly all of the top scores relied on only multilingual models*. It was somewhat mind-blowing for me to see first-hand how multilingual models were actually learning the similarities between related languages. This played a huge factor in getting good performance because there were Bengali and Telugu datasets that gave a slight boost to the model’s performance in Hindi and Tamil. My strategy, along with many others, was to use SQuAD and TyDiQA as “QA Pre-training.” This is not typical pre-training using Masked Language Modeling (MLM), but rather pre-training in the sense that the model is gaining a general understanding of language and more fine-tuning will happen afterwards. It’s just the name I’m using for the stage of training that occurs in between MLM pre-training and QA fine-tuning. This stage primes the model to have a sense of how the QA task works in general, and then the finetuning helps the model understand QA in the context of a specific language or two. Maybe midtuning or pre-fine-tuning would be better names.
*There was at least one team in the gold medal range that used an interesting technique to transfer model performance from English to Hindi and Tamil.
The limit of multilingual models
After seeing that SQuAD could serve as useful QA pre-training, I thought that I should find as many QA datasets in as many languages as possible. I found datasets in Korean, Thai, Bengali, German, French, Spanish, Arabic, Vietnamese, Japanese, Persian, English, and probably some more that I’m not remembering. This was quite a lot of data to churn through, but on TPU it only took an hour or so.
This was not a rigorous examination, but it did show the amazing cross-lingual learning ability of these models. Only after I did these tests did I notice that the XLM-R paper mentioned that training on too many languages resulted in reduced performance, making me choose only a small subset of QA datasets for QA pre-training.
Max Sequence Length and Doc Stride
Two of the most important parameters when training a QA model are the max sequence length and doc stride. Ideally, the model would be able to look at all of the passage in one pass, but the passages are usually long enough that they need to be broken up to be able to be processed by the model. When breaking it up, the doc stride is how much overlap there is between one chunk and the next. There didn’t seem to be any well-known methods for determining max sequence length and doc stride, so most people ended up empirically finding what worked best. The standard max sequence length/doc stride of 384/128 is a good default.
One trick that set me apart from other competitors was using BigBird to handle long sequence lengths. Because many of the texts had thousands of tokens, I thought it would be advantageous to use a model that does well on long contexts. BigBird is a model designed to have the attention mechanism scale linearly in memory rather than the quadratically like the default transformer. Unfortunately, the public pre-trained BigBird models are only in English, so I had to come up with a way to stick MuRIL into BigBird. This is much simpler than it seems because MuRIL is essentially RoBERTa trained on Indian languages, and since the authors of BigBird did a “warm start” by using the pre-trained weights from RoBERTa, nearly all of the parameters lined up one-to-one.
The only component of BigBird that could not directly come from RoBERTa/MuRIL was the position embeddings. These are trained parameters that are of the shape (max sequence length, hidden size) which means RoBERTa/MuRIL only has values up to a max sequence length of 512 and BigBird needs 4096. The solution to this is to tile the RoBERTa/MuRIL embeddings 8 times so it has the right dimensions.
I used the Google TPU Research Cloud (TRC) to do training on TPU v3-8 accelerators and ran a base and large model for a couple of days on mC4 data. I used the Flax scripts in the Hugging Face examples file and it could not have been simpler.
I couldn’t train the large BigBird model at 4096 sequence length, nor even 2048, but rather a measly 1024 – even at a batch size of 1 and using bfloat16. A 6-fold ensemble of these models got my highest score, earning me a silver medal in the competition.
I had decent scoring models using XLMR, RemBERT, MuRIL, and BigBird MuRIL, but I ran out of time before I was able to ensemble them together. I took inspiration from this notebook from a previous competition that turned token-level predictions into character-level predictions before combining. Different models have different tokenizers, so it isn’t straightforward to combine token-level outputs from them. The one thing they do have in common is the original context, thus the need for character-level predictions. To go from token-level to character-level, the token-level predictions are duplicated for each character.
I submitted an ensemble for scoring after the competition end, but it did poorly. If I ever get around to seeing why my ensemble failed, I’d be curious to know how well it could have done. Many of the gold medalists did similar ensembling, but theirs got good results, unlike mine.
What might be interesting to explore in the future
Two approaches I tried that didn’t yield any results were Splinter and SAM.
Splinter is a model designed for few-shot QA, which seems ideal for this competition. Just like BigBird, there was only an English model so I tried to replicate it in Hindi and Tamil to no avail. The original work was done using TensorFlow 1, and I attempted to replicate it using Hugging Face and PyTorch. I trained on TPU v3-8 but I was unable to get it to converge smoothly. Hugging Face even released official support for Splinter, but I was still unable to get it to work. I have trained Splinter on SQuAD v2 so I can confirm that the English version works.
Sharpness-Aware Minimization is an optimization technique that is supposed to lead to better generalization because it finds region of flatness in optimization space rather than sharp holes. There are papers that indicate it does well for computer vision tasks and for language tasks as well. Having a model that generalizes well is useful to not have bad surprises on the final leaderboard. When I used SAM, I had no intuition on hyperparameters, so I found my trainings diverging or just scoring the same or worse as before.
Both of these approaches seemed very useful, and I thought that Splinter would have been a gold-medal-worthy approach. I’m still curious to see how well it could have been for this competition.
If you want to add Kaggle badges to your GitHub profile or website, use this: https://github.com/subinium/kaggle-badge