LANL Earthquake Prediction

Time-series and Regression.

Shivam Baldha
12 min readApr 25, 2022
Source

Table Contents

  1. Introduction
  2. Business problem
  3. Source of data
  4. Business Constraints
  5. Performance Metrics
  6. Using of Machine Learning
  7. Existing Approaches
  8. Improvements
  9. Exploratory Data Analysis and its Observation
  10. My First Cut Approach to the above problem
  11. Feature engineering
  12. Modeling and Comparison
  13. Kaggle submission Score
  14. A final pipeline of the problem
  15. Conclusion
  16. Future Work
  17. References
  18. GitHub Repo

Prerequisite

You have to have knowledge about the machine learning algorithms, and a little bit about the time series, and feature engineering techniques.

1. Introduction

Forecasting earthquakes is one of the most important problems in Earth science because of their devastating consequences. Current scientific studies related to earthquake forecasting focus on three key points: when the event will occur, where it will occur, and how large it will be. And the reason the earthquake occurs is the rupture of the geological faults but also by the other event such as volcanic activity, landslides, mine blasts, the nuclear test, etc.

So if we predict the earthquake before the time this is very helpful for the people and we can save a life of people and animals.

2. Business problem

The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes.

Here we have to predict the remaining time before the next laboratory earthquakes based on the seismic data. The remaining time is meant by the remaining time between the current earthquake and the occurrence of the next earthquake.

Since the seismic data is continuous and the real values and our target is also real value, so this is a regression problem.

3. Source of data

LANL Earthquake Prediction is a competition held by the Department of Physics & Astronomy of the pursue university and hosted on the Kaggle. The data set contains only two columns.

The dataset folder contains the following files:

  • Train.csv this file contains 2 columns and 629 million rows, the 1st one is acoustic_data which is a wave. this wave looks like sound waves & continuous segments of experimental data. And 2nd one is time_to_failure, basically, it is the remaining time before the next earthquake.
  • Test.csv, in this folder there are many .csv files available and each file contains a series of acoustic waves, and each test file in the test set contains only 150,000 rows of acoustic waves.
  • Sample_submission.csv, It is of shape 2624 x 2 which has seg_id and time_to_failure,
  • Dataset Overview can be found in this link.

4. Business constraints

  1. The predicted value should be the Whole Number.
  2. Strict latency constraints.
  3. Incorrect forecasting may lead to missed loss of human, and animal life, and money.

5. Performance metrics

The performance metric we used for the evaluation is the Mean Absolute Error (MAE) between the Predicted and actual remaining time.

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon(Wikipedia).

Source

We also used evaluation matrice as Mean Absolute Percentage Error(MAPE).

6. Using of machine learning

As we know given the signal waves we have to predict the remaining time, But data is generated in the laboratory and the signal continues, so this problem is a time-series-based problem,

Brief intro about the Time Series

  1. Univariate Time Series, It means when you want to do something forecasting based on a single variable it’s called univariate time series. Eg earthquake prediction. And handle the univariate time series we have some good models like ARIMA, Facebook Prophet, etc. we learn all models later in this notebook.
  2. Multivariate Time Series, It means when you have more than two features and predict the variables based on other features it is called a multivariable time series, Eg when you have features like Date, Humidity, rainfall and you want to predict the Temputer based on other features, it’s become multivariate time series problem.

There is a different types of models for both time-series data.

Time-series Models

And our target value is time so these are real values hence this is a regression-based problem.

7. Existing approaches

Solution 1

Observations

  1. Here they started with how to create the new feature and some EDA parts.
  2. In EDA they found the maximum remaining time is 16 sec and this data set contains the 16 earthquakes.
  3. Modeling: they have used statistical models
  • SVR
  • XGBoost
  • Random Forest
  • Catboost

4. Feature: they have used rolling window-based feature engineering.

Solution 2

Observations

  1. This blog is written by the 1st place holder of this competition.
  2. First, they talked about features and acoustic signal manipulation.
  • In the above solution, they mention they create a statistical feature like mean, std-dev, max, etc. but the signal had a certain time-trend that caused some issues specifically on mean and quantile based features.
  • To overcome this issue their team add some error values in the acoustic signal and manipulate the data.

3. Here they apply an (i) LGB, (ii) SVR, models and got a very high MAE.

8. Improvements

The biggest challenge for me in this competition problem is computational Power because we have 629 million rows, and the size of the dataset is roughly 10 GB. To overcome this problem I used the Dask library to load the data and perform the EDA, FE, and Modeling.

I spend more time in the feature engineering part, by doing basic statistical feature engineering like mean, median, max, min, etc. but my MAE and MAPE are not improving very much.

Then I come up with one good technique which is a spectrogram image, A spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform.

Spectrograms are basically two-dimensional graphs, with a third dimension represented by colors. For more about spectrogram click here.

image representation using spectrograms

Here X-axis is time and Y-axis is frequency, so we can use frequency as a new feature. and spectrogram is basically an image so we can use image pixels as new features and this is the improvement of this case study.

we also use a rolling window-based feature as feature engineering.

9. Exploratory data analysis (EDA) and its observation

EDA will help us to understand data more clearly. with the help of EDA, we can add important features using Feature Engineering techniques which helps us to decrease MAE and MAPE scores.

9.1 Load the data and visualize

9.2 Distribution of acoustic data

PDF of acoustic wave

Here we plot a distribution of the acoustic data by using only 1% of the data because the data is very huge,

By looking at the pdf of acoustic waves we can say there are many points lie between the range of -10 to 20 and here many points are values is very high like -2000 to 4000, but here we can not say clearly about data distribution, so let’s zoom the plot of the distribution of acoustic waves.

PDF of acoustic wave

Now we take the values of acoustic waves between -30 to 30 and here also we take only 1% of the data from the total data.

Now here we clearly see the distribution and by looking at the plot we can say -10 to 20 values of acoustic wave follow a gaussian/normal distribution of data. Here most acoustic waves have a value is 8 by looking at the bin of the histogram.

Q-Q plot of acoustic wave

From the Q-Q plot, we can say -2 to 2 values of Acoustic wave is following gaussian distribution.

9.3 Distribution of time to failure

PDF and CDF of time to failure

Here we plot a distribution of our target variable which is time to failure, here we take only 1% of the data and we can say from the pdf we have a very fewer amount of data we remaining time of the next earthquake is more than 15 sec.

Here we plot a CDF of time to failure data, and we can say that 80% of data remaining time of the next earthquake is less than 9 sec.

9.4 Acoustic data + Time to failure

acoustic data with time to failure

For the clear visualization, we take very less data for this plot.

Let’s see how both variables change over time, the orange line is the acoustic data and the blue one is the time of failure

Here we can see data in the repeating pattern of the acoustic wave and wherever the signal is maxed values this signal is earthquake point, so here we have a total of 16 earthquake signals, and with the signal wave, we also plot the remaining time before the next earthquake.

We saw all earthquakes we have some remaining time for the next earthquake, and The shortest time to failure is 1.5 seconds and the longest is around 16 seconds.

9.5 Test Data

Test data distribution and simple plot

In the test data, we have a different CSV file and each CSV file contains 150,000 values of acoustic data (single column), Here I have plotted a one-sample file for the understanding and we also plot a distribution plot, and the distribution is very picked And the second plot we plot all Acoustic wave values.

9.6 Dickey-Fuller Tests

So our data is based on the time series, and to find the data is Stationary or Non-Stationary we have to perform the test,

There are many tests available to find the given data Stationary or Non-Stationary, and one of the popular tests is called the dickey fuller test or it is also called Augmented Dickey-fuller(ADF) test.

A time series is set to be stationary if the statistical property does not change over time that is it’s mean and variance are constant,
if its mean and variance change over time that means data is non-stationary.

Given the plot of the time to failure, we can say it is Stationary but how we can prove this.

To prove the data is Stationary or not where the ADF test comes into the picture, for any test we have the null hypothesis H0 and alternative hypothesis H1,

So in the ADF test Null hypothesis, H0 is Time Series is Non-Stationary in nature and the alternate hypothesis H1 is Time Series is Stationary in nature

To perform the ADF test we use an adfuller library and this returns three values:-

1:- ADF Statistic value, 2:-p-value, and 3:- Critical Values.

So if the ADF Statistic < Critical Values then we Reject the H0 which means Time Series is Non-Stationary if the ADF Statistic > Critical Values then we Accept the H0 which means Time Series is Non-Stationary.

Or the same things you can say using the p-values if the p-value is very very less reject the H0 and the p-value is high then accept the H0.

  • Code snippet:

10. My first cut approach to the above problem

I decided to create the new statistical features like mean, median, quantile values, max, min, etc. our train data set contains 629 million rows, and our test data file contains 150k rows. So here I decided to take a 150k batch size of data and generate the mean, median, max, etc values from the whole data set.

Further, to create the more features I used the rolling window based feature method with the rolling window size, rolling window means we take one size of the window/gap and we roll it into our data and after this, we create the features, as eg let's suppose we take rolling size 10 that means we take 10-row gap in the data and find the mean values from the 1st sample of the data and now we have mean values of each 10-row data, using this we create the statistical feature same as above.

Then I decided to experiment with all the below models with the hyperparameter tuning:

KNN Regression, Linear Regression, Random Forest regressor, XGBoost regressor.

With simple statistical features,
  • it seems like RF, XGB + RF regressor, and XGB performs well.
  • So now our task is to come up with some new meaningful features using some Feature engineering techniques that will help us to improve our MAE and MAPE scores.

11. Feature engineering

Here we create new features and also we use the old feature as well, and here we create three new datasets with feature engineering which are helpful to predict the time of failure.

11.1 Spectrogram image-based feature

Using the spectrogram we convert each 150k batch size of data into a spectrogram image, we take the top 500 pixel values as new features.

Now we have the simple statistical features and 500-pixel values as features, and this is our 1st dataset with feature engineering.

11.2 Spectrogram image-based feature with 200 pixel

Here we follow the same approach as above, instead of taking 500-pixel values here, I take 200-pixel values, and here I used more statistical features instead of simple statistical features, this is our 2nd data set.

11.3 Frequencies based feature

As we know Spectrogram return the three values, one of them is frequency. so we use all frequencies as new features and we combine this frequency with the statistical features, this is our 3rd dataset.

12. Modeling and comparison

  1. Now we have three different types of datasets and using all of these datasets I applied different types of models, and our goal is predict to the time to failure.
  2. In this section, we also want to understand how our newly created features help us to improve our MAE and MAPE.
  3. Here I apply KNN, LR, RF, and XGB models with the hyperparameters tuning, and I also tried the stacking models.
Comparison of all Models vs All datasets

Observation:

  • After looking at the result I can say that the Image 500-pixel feature helps a lot to reduce MAE.
  • So XGBoost and Random Forest perform better. we can think of our best models, from both these we choose the XGBoost as our final model and statistical + 500-pixel as the best feature engineering.
  • The code is very long so click here for the code.

13. Kaggle submission score:

Submission Score

I got a score of 2.489 MAE score which is in the top 5% on the leader board of the challenge.

14. A final pipeline of the problem

  • Inputs: as input, we give a .csv file that contains acoustic signal values.
  • Output: remaining time to next earthquake in sec
Output

Model deployed on local box (Video)

Model deployed on the streamlit link,

https://share.streamlit.io/shivambaldha/earthquake-prediction/main/app.py

15. Conclusion:

  1. As we saw in the first cut solution, we have only one column so we have to generate more features to predict the target values.
  2. We saw that a new feature created based on statistical values has helped a lot to improve our MAE and MAPE, eg mean, median, percentile values.
  3. To improve the score of MAE and MAPE we come up with the new features which are image and frequency-based features.
  4. Proper hyperparameter tunning also plays an important role to improve MAE.
  5. We can say that the ensemble model LGBM regressor works better for this problem.

16. Future Work:

  1. Adding new features to the model and if we more focus on the Spectrogram image feature can improve the MAE score.
  2. If somehow we can use a univariate time series-based model to create more features it helps to improve the score.
  3. Using the Deep Learning technique, we can improve the MAE score.

17. References:

  1. https://www.appliedaicourse.com/
  2. https://www.pnas.org/doi/10.1073/pnas.2011362118
  3. https://medium.com/@saivenkat_/a-detailed-case-study-on-lanl-earthquake-prediction-using-machine-learning-algorithms-beginner-to-9b38ef270887
  4. https://medium.com/@ph_singer/1st-place-in-kaggle-lanl-earthquakeprediction-competition-15a1137c2457
  5. https://www.youtube.com/watch?time_continue=228&v=s8Q_orF4tcI&feature =emb_title
  6. https://www.analyticsvidhya.com/blog/2019/12/6-powerful-featureengineering-techniques-time-series/
  7. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7435601/ Here this article explains the various type of features and one of them is a statistical feature.
  8. https://www.kaggle.com/code/thebrownviking20/everything-you-can-do-with-a-time-series/notebook Here this blog explains all types of time series techniques.
  9. https://share.streamlit.io/ use for the deploy the model.

18. GitHub Repo

Linkedin Profile

That’s all…

--

--

Shivam Baldha
Shivam Baldha

Written by Shivam Baldha

Proven track record in deploying predictive models executing data processing pipelines,and leveraging ML algorithm to tackle intricate business challenges. 📊🔬