Modeling Small Dataset using LightGBM Regressor

7 min readJun 13, 2021

Interaction with the reader is a common problem with many readers: adults/children and teachers/students. They all face the same problem: finding books close to their current reading ability, reading normally (simple level) or improving and learning (difficulty level) without being overwhelmed by so many difficulties, which often leads to a hard experience for most of us.

In this article, I will show how I developed a machine learning regression model that objectively predicts the text readability score to provide a suitable evaluation to the text based on their readability score, hence determining their reading complexity.

In this article, we will be working on CommonLit Readability Dataset, a Kaggle competition dataset.

What exactly does the term “Text Readability Score” or ‘‘Text Complexity’’ mean?

Take a moment to review the following list of elementary texts and put them in order from least complex (1) to most complex (6):

— The starving Caterpillar
— Volcanoes: Nature’s Incredible Fireworks
— Because of Winn-Dixie
— Martin’s Big Words
— Diary of a Wimpy Kid
— Harry Potter and the Sorcerer’s Stone

What criteria did you use in your rankings? Did you think about the content and how accessible it might be to readers? Did you consider the vocabulary used in these texts and their general language style? Maybe you considered the length of the text overall, how many syllables were in the longest words, the length of the sentences.

When ranking the complexity of these texts, you were thinking about quantitative features that can be counted, like the number of syllables, and also qualitative features such as the language used, the complexity of the shared ideas, and other attributes of the text, such as its structure, style, and levels of meaning. In your ranking, if you thought about how challenging the text would be for a specific reader or group of readers, you were considering a third dimension of text complexity referred as text factors. Readability scores are approximations. A text with a lot of grammar problems may receive the same readability score as a text with perfect grammar.

In this article, we will build a machine learning model to evaluated the historical data of text readability scores and their respective passages to predict the readability score of other unseen passages. A total of 2834 passages with their scores are present in the training dataset, which are analyzed using 9 quantitative features and 9 text complexity analysis algorithms. The quantitative features and algorithms of text complexity showed a trend of high positive and negative covariance among each other, forcing to drop 4 out of 9 quantitative features and 7 out of 9 algorithms. Using the remaining features we used LGBM Regression to build a model for predicting the readability scores of the other unseen passages.

Loading Data

The starting point of the machine learning problem begins with importing and collecting the data.

The dataset provided by CommonLit Readability contains 2834 different passages.
The Target variable representing the current readability score of the passage is present in the dataset. Hence, it is a supervised problem.
The final shape of the training dataset is,

4. The other independent variable of the dataset represents id, URL, license, and standard error for each passage. Where URL and License mostly contain null values in them.

5. Dropping the columns with null values from the training dataset

Exploratory Data Analysis

In this first step of the analysis, we considered a plot to represent the density distribution of the target variable. The graph is normally distributed for the target variable. The scoring done for the article represents that most of them are medium complex and others are either too easy or too hard to be classified correctly.

2. Graph representing standard error in the dataset looks to be at 0.5 error value which means that most of the passages which the user doesn’t understand are misjudge to be medium or hard in reading complexity.

Feature Engineering

In the first step of our pipeline, we cleaned and tokenize the text into sentences and words. Then we calculated text as an array of numbers, including the number of sentences, total words, average word count, number of stop words, number of unique words and etc.

Each passage is represented by a vector of 9 numbers, each of them being a text feature such as:

Number of sentences per passage,
Number of words per passage,
Mean number of words per sentence,
Count of total letters in the passage,
Count of unique words in the passage,
Count of words that do not stop words in the passage,
Length of Max and Min sentence in the passage,
Mean of the number of words considered “difficult” in a sentence.

Additionally, Each passage was further represented by a vector of 9 readability formulas, each of them being a text feature such as:

Dale Chall Readability Score
Coleman Liau Index
Smog Index
ARI (Automated Readability Index)
Linsear write Formula
Gunning Fog
Flesch Reading Ease
Flesch Kincaid Grade

Note: I will be posting another article explaining each text readability formula and its tools.

NLP Pipeline

Now that we built a set of features representing a text, we want to truncate these vectors to the most important features; that most discriminate our features. Using features that do not contain information about the target variable (readability score) is a computational time load for the model because the inference is performed with more features than necessary.

To perform this step, we use the Truncate SVD also called LSA with TfidfVectorizer (Tfidf Vectorizer is the process of calculating the weights of the features based on their term frequency in the dataset) with 10 components and using 10 different iterations.

Implementing LightGBM Model

Our output variable is numerical and continuous which narrows the spectrum of machine learning models applicable to our dataset. To select an appropriate model, there are several indicators that may guide one’s choice, such as the number of features or the number of samples available.
In the case of LightGBM, It is a gradient boosting framework that makes use of tree-based learning algorithms that are considered to be a very powerful algorithm when it comes to computation. The trees in LGBM grow vertically using the leaf with large loss to grow.

The limitation with LightGBM is that it does perform well on the small dataset and it mostly overfits the small datasets (rows less than 10000). To avoid the overfitting of the LightGBM on our dataset we tuned the parameters of the algorithm using GridSearchCV to help us find the most suitable parameters to avoid the overfitting of our model.

The parameters which we found to be best for our model are :

LGBMRegressor(learning_rate=0.05, max_depth=2,num_leaves=50)

Predicting readability scores

We now have a predicting model that takes a passage feature (obtained through pre-processing) as input and gives a readability score as output.

Performance of the Model

Overall our best model achieves around 0.7 for the metric RMSE indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. As seen in the EDA above the target variable has ranged between -4 and 2. then if our model RMSE is around 0.7, the estimators seem to be good.

Statistically, our results seem satisfactory. However, we have room for improvement with this approach and we are going to evaluate the robustness of our model with human experts giving their feedback in the loop.

Note: As a premise of our next article, we are currently working on another approach to evaluating text readability using XGBoost Regression