Figure 4: Correlation matrix for data of spinning band distillation column, left hand parameter has a linear impact to top hand parameter with positive or negative correlation
The correlation matrix indicates that there is a strong linear relationship between the temperature measurements and heater power, which is expected for a distillation column (1). Further, there is a strong relationship between distillate and bottom product mass (2), but these features are not considered for the pressure drop forecast as it is known from experience that there is no significant impact on the pressure drop in the column. The same argument applies to the liquid level in the bottom (3). In terms of temperature measurements, the temperature in the head of the column is retained as a feature, because it contains information on the boiling point of the volatile component and the current concentration. Pressure drop is kept as a feature as it describes the recent pressure drop trend, which can be useful for the forecast. The remaining parameters show no strong linear relationship. Furthermore, as known from experience and physical relationships the liquid hold-up directly influence pressure drop in the distillation column. Thus, they are selected as features as well. In total, 6 parameters (pressure drop, column head temperature, band rotation speed, heater power, feed flow, and reflux ratio) are selected and used to model the forecast.
The clustering step will be performed based on the pressure drop data alone to identify flooding behavior in the distillation column. Pressure drop is preprocessed and transformed as described for the forecast problem in order to maintain the same data structure and facilitate the implementation with live data. Time series data can be typically decomposed into the following four features: trend, level, seasonality and noise. To ensure good visualization and interpretability of the occurring clusters, two features are chosen for the clustering process. As the flooding behavior does not occur in specific regular intervals (seasonality) and noise has been reduced by means of EWMA, trend and level should contain the significant information to identify meaningful clusters and are therefore chosen as features.

Model training and validation

Data from the spinning band distillation column is acquired in intervals of one second and since flooding happens abruptly, it is important to maintain this sample frequency despite the large amount of data that is collected. Therefore, scalable and computationally inexpensive models based on regression trees, which are explained in more detail in section 1.1, are prioritized in the scope of this work. These bagging and boosting methods will be used with regression trees as base estimators for the pressure drop forecast and their performance will be compared based on chosen metrics, i.e. root mean squared error and coefficient of determination (R²). Additionally, linear regression will be applied for the pressure drop forecast to serve as a reference model.
The window and response size are determined via a grid search using a representative regression model (random forest regression) with the default settings from the scikit-learn library in Python. Investigated window sizes range from 5 to 20 s and response sizes from 15 to 30 s. The goal is to use a small window size to keep the amount of data during the transformation small (Figure 3) and a large response size for a long forecast, while maintaining a good prediction accuracy (R² > 0.95). Training data consists of 8 and test data of 2 recorded distillation runs, which corresponds to 54948 and 9884 measurements, respectively. The resulting accuracies for different window and response sizes are given in Table 1 in the form of RMSE and R².
Table 3: Root mean squared error (RMSE) and coefficient of determination (R²) for different window and response sizes using random forest regression.