LANL Earthquake Prediction

Time-series and Regression.

12 min readApr 25, 2022

Table Contents

Introduction
Business problem
Source of data
Business Constraints
Performance Metrics
Using of Machine Learning
Existing Approaches
Improvements
Exploratory Data Analysis and its Observation
My First Cut Approach to the above problem
Feature engineering
Modeling and Comparison
Kaggle submission Score
A final pipeline of the problem
Conclusion
Future Work
References
GitHub Repo

Prerequisite

You have to have knowledge about the machine learning algorithms, and a little bit about the time series, and feature engineering techniques.

1. Introduction

Forecasting earthquakes is one of the most important problems in Earth science because of their devastating consequences. Current scientific studies related to earthquake forecasting focus on three key points: when the event will occur, where it will occur, and how large it will be. And the reason the earthquake occurs is the rupture of the geological faults but also by the other event such as volcanic activity, landslides, mine blasts, the nuclear test, etc.

So if we predict the earthquake before the time this is very helpful for the people and we can save a life of people and animals.

2. Business problem

The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes.

Here we have to predict the remaining time before the next laboratory earthquakes based on the seismic data. The remaining time is meant by the remaining time between the current earthquake and the occurrence of the next earthquake.

Since the seismic data is continuous and the real values and our target is also real value, so this is a regression problem.

3. Source of data

LANL Earthquake Prediction is a competition held by the Department of Physics & Astronomy of the pursue university and hosted on the Kaggle. The data set contains only two columns.

The dataset folder contains the following files:

Train.csv this file contains 2 columns and 629 million rows, the 1st one is acoustic_data which is a wave. this wave looks like sound waves & continuous segments of experimental data. And 2nd one is time_to_failure, basically, it is the remaining time before the next earthquake.
Test.csv, in this folder there are many .csv files available and each file contains a series of acoustic waves, and each test file in the test set contains only 150,000 rows of acoustic waves.
Sample_submission.csv, It is of shape 2624 x 2 which has seg_id and time_to_failure,
Dataset Overview can be found in this link.

4. Business constraints

The predicted value should be the Whole Number.
Strict latency constraints.
Incorrect forecasting may lead to missed loss of human, and animal life, and money.

5. Performance metrics

The performance metric we used for the evaluation is the Mean Absolute Error (MAE) between the Predicted and actual remaining time.

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon(Wikipedia).

We also used evaluation matrice as Mean Absolute Percentage Error(MAPE).

6. Using of machine learning

As we know given the signal waves we have to predict the remaining time, But data is generated in the laboratory and the signal continues, so this problem is a time-series-based problem,

Brief intro about the Time Series

Univariate Time Series, It means when you want to do something forecasting based on a single variable it’s called univariate time series. Eg earthquake prediction. And handle the univariate time series we have some good models like ARIMA, Facebook Prophet, etc. we learn all models later in this notebook.
Multivariate Time Series, It means when you have more than two features and predict the variables based on other features it is called a multivariable time series, Eg when you have features like Date, Humidity, rainfall and you want to predict the Temputer based on other features, it’s become multivariate time series problem.

There is a different types of models for both time-series data.

And our target value is time so these are real values hence this is a regression-based problem.

7. Existing approaches

Solution 1

A Detailed Case Study on LANL Earthquake Prediction using Machine Learning Algorithms-Beginner to…

In this blog post, we will be talking about and taking this problem which was in Kaggle Competition a year ago ie., 2019. Here…

medium.com

Observations

Here they started with how to create the new feature and some EDA parts.
In EDA they found the maximum remaining time is 16 sec and this data set contains the 16 earthquakes.
Modeling: they have used statistical models

SVR
XGBoost
Random Forest
Catboost

4. Feature: they have used rolling window-based feature engineering.

Solution 2

1st place in Kaggle LANL Earthquake Prediction Competition

I am excited to report another winning solution on Kaggle for the LANL Earthquake Prediction competition. The goal of…

medium.com

Observations

This blog is written by the 1st place holder of this competition.
First, they talked about features and acoustic signal manipulation.

In the above solution, they mention they create a statistical feature like mean, std-dev, max, etc. but the signal had a certain time-trend that caused some issues specifically on mean and quantile based features.
To overcome this issue their team add some error values in the acoustic signal and manipulate the data.

3. Here they apply an (i) LGB, (ii) SVR, models and got a very high MAE.

8. Improvements

The biggest challenge for me in this competition problem is computational Power because we have 629 million rows, and the size of the dataset is roughly 10 GB. To overcome this problem I used the Dask library to load the data and perform the EDA, FE, and Modeling.

I spend more time in the feature engineering part, by doing basic statistical feature engineering like mean, median, max, min, etc. but my MAE and MAPE are not improving very much.

Then I come up with one good technique which is a spectrogram image, A spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform.

Spectrograms are basically two-dimensional graphs, with a third dimension represented by colors. For more about spectrogram click here.

Here X-axis is time and Y-axis is frequency, so we can use frequency as a new feature. and spectrogram is basically an image so we can use image pixels as new features and this is the improvement of this case study.

we also use a rolling window-based feature as feature engineering.

9. Exploratory data analysis (EDA) and its observation

EDA will help us to understand data more clearly. with the help of EDA, we can add important features using Feature Engineering techniques which helps us to decrease MAE and MAPE scores.

9.1 Load the data and visualize

9.2 Distribution of acoustic data

Here we plot a distribution of the acoustic data by using only 1% of the data because the data is very huge,

By looking at the pdf of acoustic waves we can say there are many points lie between the range of -10 to 20 and here many points are values is very high like -2000 to 4000, but here we can not say clearly about data distribution, so let’s zoom the plot of the distribution of acoustic waves.

Now we take the values of acoustic waves between -30 to 30 and here also we take only 1% of the data from the total data.

Now here we clearly see the distribution and by looking at the plot we can say -10 to 20 values of acoustic wave follow a gaussian/normal distribution of data. Here most acoustic waves have a value is 8 by looking at the bin of the histogram.

From the Q-Q plot, we can say -2 to 2 values of Acoustic wave is following gaussian distribution.

9.3 Distribution of time to failure

Here we plot a distribution of our target variable which is time to failure, here we take only 1% of the data and we can say from the pdf we have a very fewer amount of data we remaining time of the next earthquake is more than 15 sec.

Here we plot a CDF of time to failure data, and we can say that 80% of data remaining time of the next earthquake is less than 9 sec.

9.4 Acoustic data + Time to failure

For the clear visualization, we take very less data for this plot.

Let’s see how both variables change over time, the orange line is the acoustic data and the blue one is the time of failure

Here we can see data in the repeating pattern of the acoustic wave and wherever the signal is maxed values this signal is earthquake point, so here we have a total of 16 earthquake signals, and with the signal wave, we also plot the remaining time before the next earthquake.

We saw all earthquakes we have some remaining time for the next earthquake, and The shortest time to failure is 1.5 seconds and the longest is around 16 seconds.

9.5 Test Data

In the test data, we have a different CSV file and each CSV file contains 150,000 values of acoustic data (single column), Here I have plotted a one-sample file for the understanding and we also plot a distribution plot, and the distribution is very picked And the second plot we plot all Acoustic wave values.

9.6 Dickey-Fuller Tests

So our data is based on the time series, and to find the data is Stationary or Non-Stationary we have to perform the test,

There are many tests available to find the given data Stationary or Non-Stationary, and one of the popular tests is called the dickey fuller test or it is also called Augmented Dickey-fuller(ADF) test.

A time series is set to be stationary if the statistical property does not change over time that is it’s mean and variance are constant,
if its mean and variance change over time that means data is non-stationary.

Given the plot of the time to failure, we can say it is Stationary but how we can prove this.

To prove the data is Stationary or not where the ADF test comes into the picture, for any test we have the null hypothesis H0 and alternative hypothesis H1,

So in the ADF test Null hypothesis, H0 is Time Series is Non-Stationary in nature and the alternate hypothesis H1 is Time Series is Stationary in nature

To perform the ADF test we use an adfuller library and this returns three values:-

1:- ADF Statistic value, 2:-p-value, and 3:- Critical Values.

So if the ADF Statistic < Critical Values then we Reject the H0 which means Time Series is Non-Stationary if the ADF Statistic > Critical Values then we Accept the H0 which means Time Series is Non-Stationary.

Or the same things you can say using the p-values if the p-value is very very less reject the H0 and the p-value is high then accept the H0.

Code snippet:

10. My first cut approach to the above problem

I decided to create the new statistical features like mean, median, quantile values, max, min, etc. our train data set contains 629 million rows, and our test data file contains 150k rows. So here I decided to take a 150k batch size of data and generate the mean, median, max, etc values from the whole data set.

Further, to create the more features I used the rolling window based feature method with the rolling window size, rolling window means we take one size of the window/gap and we roll it into our data and after this, we create the features, as eg let's suppose we take rolling size 10 that means we take 10-row gap in the data and find the mean values from the 1st sample of the data and now we have mean values of each 10-row data, using this we create the statistical feature same as above.

Then I decided to experiment with all the below models with the hyperparameter tuning:

KNN Regression, Linear Regression, Random Forest regressor, XGBoost regressor.

it seems like RF, XGB + RF regressor, and XGB performs well.
So now our task is to come up with some new meaningful features using some Feature engineering techniques that will help us to improve our MAE and MAPE scores.

11. Feature engineering

Here we create new features and also we use the old feature as well, and here we create three new datasets with feature engineering which are helpful to predict the time of failure.

11.1 Spectrogram image-based feature

Using the spectrogram we convert each 150k batch size of data into a spectrogram image, we take the top 500 pixel values as new features.

Now we have the simple statistical features and 500-pixel values as features, and this is our 1st dataset with feature engineering.

11.2 Spectrogram image-based feature with 200 pixel

Here we follow the same approach as above, instead of taking 500-pixel values here, I take 200-pixel values, and here I used more statistical features instead of simple statistical features, this is our 2nd data set.