Pattern Based Multivariable Regression using Deep Learning (PBMR-DP) CVPR Proceedings
Abstract
We propose a deep learning methodology for multivariable regression that is based on pattern recognition that triggers fast learning over sensor data. We used a conversion of sensors-to-image which enables us to take advantage of Computer Vision architectures and training processes. In addition to this data preparation methodology, we explore the use of state-of-the-art architectures to generate regression outputs to predict agricultural crop continuous yield information. Finally, we compare with some of the top models reported in MLCAS2021. We found that using a straightforward training process, we were able to accomplish an MAE of 4.394, RMSE of 5.945, and of 0.861.
1 Introduction
In the recent years, machine learning algorithms have been improving dramatically in different areas. Unsupervised methods have been incorporated in the deep learning field to solve image-based problems, sound, and text. We also notice that neural network architectures have changed and consequently, they have changed the training process. Some works have also tried to make changes into the backbone network [16] to achieve better results. But sometimes, the innovation blinds some improvement in promising ideas that were not developed to a higher potential. Here, we present our work that combines state-of-the-art image architecture and regression.
Inspired by the data provided in [13], a sensor dataset containing information of multiple sensors with time-stamp. We decided to take a different approach and explore the conversion of this dataset into images (Section 3.1). This conversion opens the doors of Computer Vision (CV) models for tabular data. First, we explored the conversion of sensor data into an accurate image-like data, and then make changes in the neural network architecture as common CV architectures do not tend to give regression as output which was the case for our model. This allows us to perform multivariable regression as in [1] which is pattern-driven instead of data-driven.

1.1 Contribution

In this work, we present two major contributions. The first one is constructing sensors-to-image conversion in which tabular data can be represented as an image. This facilitates the use of modern CV architectures. Secondly, using these sensors-to-image samples to predict continuous crop yield values.
2 Related Works
We did not want to base our architecture on long short-term memory (LSTM), which usually takes a lot of resources to perform the training process and hence compelled us towards using images. This led us to do exploration over methods that involved images and regression. To get started, we explored the idea around image age detector, which affirmed our concerns. Work done in [11] deals with the creation of two Convolutional Neural Networks (CNNs), one to predict gender and another for age prediction with a classifier instead of a regressor. In practice, there is not much done in terms of having a regression output from an image-based model.
Finding that many approaches to what, in our knowledge, are regression problems have in common the characteristics of converting it to a classification problem led us to explore other fields. We started by looking at [4], in which they work on a network able to predict the rotation angle of given images. A similar idea can be seen in [9], which shows a CNN regression framework for predicting 3D pose estimation.
In another hand, we explore the conversion of sensor data into images such as [18]. The data was also serialized in such work and represented different factors that we did not deal with. Therefore, their conversion was more complex than in this work, but the idea of generating these images is viable. The melspectogram generates images using the Librosa[10] package, allowing classification of sounds based on patterns. Visualizing sound as an Image [14, 3] with DNNs improves accuracy and reduces computational requirements from classical methods of event or pattern recognition [7]. Proving that the translation from another medium to image has worked in the past.
The use of CNNs in image classification has become the day’s standard. The image classification revolution began with the use of AlexNet [8]. The inception models are carefully customized multi-branch architectures with carefully designed branches. ResNet [5], ResNeXt [17], and EfficientNet [15] are some examples of modern architectures.
Time series data becomes complex when the number of sensors and the frequency of data recording increases. The current solution is regression to find the best fit based on the multivariable data. Early proposed solutions require the conversion and generation of custom CNN like a 2 stage CNN proposed in [2]. The usage of detecting patterns requires much pre-processing with feature engineering. The process is time-consuming and will require extensive study of the correlation of each input date with the training data.
Performance | |||||||||
---|---|---|---|---|---|---|---|---|---|
Models | MAE | RMSE | |||||||
SGD | Adam | LARS | SGD | Adam | LARS | SGD | Adam | LARS | |
ResNet 50 | 4.529 | 5.496 | 4.644 | 5.963 | 7.258 | 6.266 | 0.849 | 0.792 | 0.845 |
EffificientNet B0 | 5.535 | 5.232 | 6.577 | 7.312 | 6.958 | 8.586 | 0.789 | 0.809 | 0.709 |
ResNeXt50 | 4.394 | 5.371 | 5.191 | 5.945 | 7.118 | 6.889 | 0.861 | 0.799 | 0.812 |
Competition Teams | Model approaches | Performance | ||
---|---|---|---|---|
MAE | RMSE | |||
QU(exp006) | Statistical Modelling | 4.41 | 5.89 | 0.87 |
CUFE | ensemble Regression | 4.42 | 5.95 | 0.86 |
Star | M/4* 1D CNN with Ensemble | 4.47 | 5.95 | 0.86 |
Elendil | M/7 * 1D CNN with Ensemble 5 | 4.47 | 5.95 | 0.86 |
AA2 | XgBoost | 4.6 | 6.15 | 0.85 |
PBMR-DP | ResNeXt 50 - SGD | 4.39 | 5.94 | 0.86 |
3 Method
In this section, we will explore the input pipeline, architecture design, and our approach to utilize the feature learning ability of DNNs to solve multivariable regression problems.
3.1 Input Data
Our dataset is based on temporal data, which is computed in real-time. It can be noisy due to the different measuring speeds of the dataloggers [6] or the sensors’ measurement of the values themselves. The initial assumption is that all the data is measured over the same time-space, corrected, or spread to a fixed tabular form. Sensor data, in particular, is considered as the ranges for sensors are absolute, ensuring that on normalization stage in pre-processing values are between 0 and 1.
The Soybean Crop Yield dataset found in the MLCAS2021 challenge is composed of 93000 samples over 214 days (1 crop season) with seven sensor readings, each pointing to a Single Crop Yield (y). There is also some additional information such as genotype ID, location ID, and year for each sample. This additional information is also normalized and treated like a sensor. Therefore, it is used as one of the rows in the input data after pre-processing.
3.2 Pre-processing
Before feeding machine learning models with data, we must pre-process the original data and statistically analyze it extensively before using them as input data. This process is time-consuming and requires human and computer resources to verify the correlation of the data to the output it is being trained with. Our process is different since we convert tabular data into images. The input data is arranged in the sensor data format as rows with time along the y-axis. Unlike most image processing steps in CNNs, we apply a Row Normalization technique. Each row is normalized based on the absolute range of the sensors Eq. 1. This makes sure the final table generated contains values between 0 and 1.
(1) |
where is the normalized data point at positions . The values in represent the original tabular data in which represents the row (our sensor), and the time in our dataset. In addition, and represent absolute minimum and maximum values of sensor where is the set of all the sensors.
Our data preparation method from tabular data explained above allows it to be fed directly to CNNs without major modifications to the architecture. The tabular data must be across a common measurement axis, such as time series or measured at the same interval. If any values are missing in the tabular data, we will use the immediate past data to fill the missing blank in the table. This property of time series data helps ensure noise is reduced to a minimum in the input data. The generated tabular data is normalized row-wise based on the absolute range of the measured variable (sensor). Fig. 1 shows how the data can be visualized with patterns.
Regression Analysis Techniques | Performance | ||
---|---|---|---|
MAE | RMSE | ||
Linear Regression | 6.100 | 8.121 | 0.740 |
Elastic Net | 9.103 | 11.548 | 0.471 |
LASSO | 9.987 | 12.790 | 0.363 |
SVR-RBF | 5.976 | 7.875 | 0.758 |
Stacked-LSTM | 5.484 | 7.276 | 0.792 |
Temporal Attention | 5.441 | 7.239 | 0.795 |
PBMR-DP | 4.394 | 5.945 | 0.861 |
3.3 Model Input
The data generated explain in Sec. 3.2 is similar to how an image is usually fed into a ConvNet as a 3D array. We will use the same ideology to directly generate (in this particular case) a 3D data array in the range 0 and 1. The data is normalized specifically to each row and not batch normalized for the entire slice. Normalization is performed since each row is sensor data over time with absolute ranges. Ex. Sensor A with a range of 0 - 100 and sensor B with a range of -1 to 25, requires different normalization. Row-based normalization will not affect the model or the output in any sense as the model is blind to how the data was generated. On testing using a batch normalization method with unique time-series data, sensors with very small ranges were found to have limited or low impact on the final results.
The generated data (Fig. 1) is fed into the models to look for features and patterns instead of solving for the values. This approach allows us to maximize the learning ability of neural networks instead of trying to solve the best fit method. The slow trial and error of assigning a range of values to a pattern seen or observed by the model instead of solving the best equation for a set of time-based variables.
3.4 Architecture Design
The model relies on the feature learning/pattern recognition of CNNs. This characteristic is heavily used in classification models. The idea is to modify a few layers to convert them into a regression pattern model, which outputs a single regression yield output instead of class probability with softmax. The base architecture can be found in Fig. 2.
Instead of classification, we introduce an Adaptive Concat Pool layer right after the feature learning layers to understand regression data. Adaptive Concat Pool combines the Adaptive Average Pool and Adaptive Max Pooling layers defined in the PyTorch framework. This custom layer allows us to convert the problem into a FCN approach to the regression values. The use of DNNs with different optimizers and fixed hyper tuning allows us to maximize the results. These adjustments that followed the state-of-the-art architectures create a single output for each 3D input.
Bellow we describe the three architectures used in this work. As mentioned before we focused in ResNets, EfficientNets, and ResNeXt.
ResNet: The addition of shortcut connections in each residual block enables gradient flow directly to the bottom layers. ResNet[5] allows for extremely deep structures for state-of-the-art object detection performance, which is used as the baseline model for the entire approach of using 3D data in regression. Initial use case with default parameters from PyTorch models shows comparable performance and results to current solutions in the domain of Yield Estimation. The version ResNet50 was used in our experiments.
EfficientNet: To demonstrate the effectiveness of scaling on both depth and resolution aspects of the existing CovNet model, a new and more mobile size baseline was designed called EfficientNet[15]. The Neural Architecture was focused on optimizing the accuracy and FLOPs required to detect the same images. The base version EfficientNet b0 was used in our experiment.
ResNeXt: In addition to the dimensions of depth and width of ConvNet, the paper introduces ”Cardinality”, a definition for the size of transformations. Allows controlling the ”Network-in-Neuron” to approach optimal results in the process. Makes significant accuracy improvements on Popular ConvNets hence named as ResNeXt [17]. The version ResNeXt50 was used in our experiments.
3.5 Reduced Feature Engineering
As explained in Sec. 3.2, the direct conversion of sensor values to the floating-point between 0 and 1 allows us full data retention. There is no approximation or wrong detection since we have no data loss during translation (normalization). Using the property of Translational invariance and Translational equivariance, we allow the models to learn from the patterns in the feature learning stage of the model. The Auto-learning ability of CNN models allows us to eliminate the need for the entire process of feature engineering, such as correlation analysis and Principal Component Analysis (PCA).

4 Experiment
In the following section, the proposed data usage approach is evaluated with different state-of-the-art machine vision models. An ML tool chain was created to perform continuous tests in similar data settings and hardware setup. We conducted an ablation experiment on Crop Yield Regression Task [13]. It is a multivariable regression problem with 7 daily variables measured over a fixed time period of 214 days. The models where run in a Intel i9-10900k CPU with 128 GB 2666MHz RAM and NVIDIA RTX 3090 with 24 GB VRAM. The data set produced image size of 214x7 which allowed to run multiple models simultaneously to produce maximum results.
Throughout the experiments, the learning rate is set to with a batch size of 128, 1,000 epochs and the loss after trial and error was fixed to MSEloss or L1loss. The modeling was programmed in python 3.8 using the PyTorch framework [12]. We follow [17, 5, 15] to construct the Feature learning stage of the models (depth). The pooling layer is modified to a custom Adaptive Concat Layer with Fully connected layers pointed to a single output.
4.1 Experiments on Crop Yield Dataset
The extensive samples of the crop yield with 93,000 samples allow the model to learn behaviors very well. The data consists of 7 weather variables, namely Average Direct Normal Irradiance (ADNI), Average Precipitation (AP), Average Relative Humidity (ARH) Maximum Direct Normal Irradiance (MDNI), Maximum Surface Temperature (MaxSur), Minimum Surface Temperature (MinSur) and Average Surface Temperature (AvgSur). The secondary inputs are also provided for each data point: Maturity group (MG), Genotype ID, State, Year, and Location. Each data frame points to a ground truth which is the yield.

4.2 Performance Metrics
Unlike the accuracy metrics, which are usually associated with classification problems, to define the regression, we used the standard metrics such as Mean Average Error (MAE), Root Mean Square Error (RMSE), and to evaluate the performance. The loss function used in the model is MSEloss or L1loss in the PyTorch framework. k-cross-validation is performed to overcome over-fitting of data. Significant improvements are noted in validation datasets. Significant improvements are noted in validation datasets. The data was tested and compared with the same test dataset as the MLCAS2021 competition to keep the results and metrics constant and form a common comparison baseline.
Figures 3-5 show the performance metrics of the top three models conducted on the crop yield data set with the proposed architecture. In Figure 3, we see that Efficient Net b0 as designed learns faster, but as the model is not deep enough, it saturates after 400 epochs. Both ResNet and ResNeXt learn slower but restarts the learning process at each k-fold change.
5 Results and Discussion

Comparison with different models: Table 1 shows the results gathered when comparing the different networks with different optimizers. Here we explore Stochastic Gradient Descent, Adam Optimizer, and LARS with the same parameters and metrics described in 4. We found that ResNeXt50 with SGD optimizer performed the best in the three different metrics used for this experiment. The second and third best models were ResNet50 with SGD and LARS, respectively. This tells us that for this use case, having an SGD is better during the training process of our network.
Comparing Competition approaches: Table 2 shows the performance of different online teams from the MCLAS Challenge. The best models were shown in the online leaderboard and available publicly for the challenge. Some of these works relied upon heavy statistical analysis and feature engineering in multiplying the number of available features to improve learning parameters for the data. Most of the results involved using ensemble techniques to combine weights generated using different models to get the best results. Our approach is simpler with just the modified DNNs to become a regression model with a custom data loader to convert Real-time data into an image type array. This table shows that our model outperforms the methods in the competition except for one method. We are able to outperform QU(exp006) only in MAE but not in the other metrics. It is noteworthy that we have trained our model without optimizing the hyperparameters as we wanted our solution to work as a general method. Fine tuning hyperparameters would help improve our results.
Comparison with state-of-the-art results: Table 3 shows the crop yield prediction dataset results. Our results prove a dramatic increase in prediction performance with a simple change in how data is used. In addition, our model approach allows for faster data to model regression without the need for analysis of the correlation between the inputs and the output. This table shows the different published works that used our same dataset. We can see that our model outperforms these methods in each selected metrics.
6 Conclusion
This work provides a pattern-based approach for multivariable regression. With our sensor-to-image conversion, we are able to bring computer vision and convolutional neural network techniques to regression tasks. Our method of sensor-to-image conversion is completely lossless. Our experiment with multiple models and different optimizers proves the validity of our method. We have outperformed every classical approach and are at par with the best ensemble methods. In addition, we hope to make a significant impact with tabular data and advance the research even further in these areas.
References
- [1] Evangelos Alexopoulos. Introduction to multivariate regression analysis. Hippokratia, 14:23–8, 12 2010.
- [2] Roy Assaf and Anika Schumann. Explainable deep neural networks for multivariate time series predictions. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6488–6490. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
- [3] Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. Interpreting and explaining deep neural networks for classification of audio signals, 2019.
- [4] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Image orientation estimation with convolutional networks. In GCPR, 2015.
- [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- [6] Jiztom Kavalakkatt Francis. Cloud-Based Multi-Sensor Remote Data Acquisition System for Precision Agriculture (CSR-DAQ). PhD thesis, Iowa State University, 2019. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2021-05-22.
- [7] Zvi Kons and Orith Toledo-Ronen. Audio event classification using deep neural networks. In INTERSPEECH, 2013.
- [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
- [9] Siddharth Mahendran, Haider Ali, and Rene Vidal. 3d pose regression using convolutional neural networks, 2017.
- [10] Brian Mcfee, Colin Raffel, Dawen Liang, Daniel P W Ellis, Matt Mcvicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. PROC. OF THE 14th PYTHON IN SCIENCE CONF, 2015.
- [11] Abdullah M. Abu Nada, Eman Alajrami, Ahemd A. Al-Saqqa, and Samy S. Abu-Naser. Age and gender prediction and validation through single user images using cnn. In Semantic Scholar, 2020.
- [12] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- [13] Johnathon Shook, Tryambak Gangopadhyay, Linjiang Wu, Baskar Ganapathysubramanian, Soumik Sarkar, and Asheesh K. Singh. Crop yield prediction integrating genotype and weather variables using deep learning. PLOS ONE, 16(6):e0252402, Jun 2021.
- [14] Sugianto Sugianto and Suyanto Suyanto. Voting-based music genre classification using melspectogram and convolutional neural network. In 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), pages 330–333, 2019.
- [15] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020.
- [16] Subrahmanyam Vaddi, Dongyoun Kim, Chandan Kumar, Shafqat Shad, and Ali Jannesari. Efficient object detection model for real-time uav application, Jan 2021.
- [17] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks, 2017.
- [18] Chao-Lung Yang, Zhi-Xuan Chen, and Chen-Yi Yang. Sensor classification using convolutional neural network by encoding multivariate time series as two-dimensional colored images. Sensors, 20(1), 2020.