Introduction

Real-time monitoring of seismic activity is crucial for early warning systems, helping to safeguard life and property in high-risk areas by detecting hazardous earthquakes1, volcano eruptions2 and potential breaches of international law3. Seismic signals from natural earthquakes tend to have low frequencies, while explosion events exhibit a broader frequency range, and vehicular movements present resonant signatures. These spectral characteristics usually can only be effectively analyzed when long enough seismic signals have been recorded. Identifying these unique sources is crucial in geophysics, not only for earthquake monitoring but also for applications like detecting illegal blasting or monitoring traffic-induced ground vibrations. With the rise of the Internet of Things (IoT)4, there is an increasing demand to deploy numerous monitoring nodes driven by artificial intelligence (AI) technology in concerned areas targeting low-cost and low-latency requirements.

The detection of seismic events has been extensively studied5,6,7,8,9,10, with many machine learning models performing well for seismic analysis. However, most struggle to meet cost-efficient demands due to their requirements for large computational resources and a significant number of parameters. Popular methods fall into two main categories: classical machine learning and deep learning methods. Classical machine learning models, such as E3WS1 and 2S-ML-EIOS11, rely heavily on manually defined features, but they are sensitive to noisy seismic data and require substantial real-time computing operations. Specifically, XGBoost, as a key component based on decision-trees algorithm in both E3WS and 2S-ML-EIOS, relies heavily on multi-threaded CPUs to facilitate parallel computing. On the other hand, deep learning models based on neural networks offer resilience to noise in seismic waveforms but often have a large number of parameters, requiring significant computational resources. Some deep learning models also use high-order features such as spectral images extracted using waveforms during inference as input5,12,13,14,15,16, while others use processed multi-component waveforms directly17,18.

One prevalent challenge for these deep learning models is their substantial number of parameters. For example, Vision-Transformer has up to approximately 24 million parameters19. Another challenge lies in achieving real-time performance, as most open-source solutions require long input waveforms (tens of seconds to minutes) after a seismic event, which is more suitable for reviewing historical data rather than real-time monitoring13,16,20. While these models may perform well in situations where all information is known beforehand, real-time monitoring missions face the challenge of an unknowable future. Consequently, making timely judgments based on just a few seconds of data more closely aligns with the demands of such scenarios.

Large deep learning models have attracted significant interest for various tasks, including computer vision21,22 and language understanding23,24. These large models consist of millions to billions of parameters and require high-performance computers equipped with expensive GPUs10,20,25. Many models consume considerable computational resources (tens of gigabytes memory) and electric energy during both training and operation. Consequently, they thrive on high-performance personal computers and cloud-computing platforms that demand kilowatt to megawatt power26.

In contrast, battery-powered devices have received less attention globally, yet billions of low-power (milliwatts to watts) devices such as microcontroller units (MCUs) serve various aspects of our lives, including smart agriculture27, industry28, environment29, healthcare30, and consumer electronics31. Local machine-learning operations on IoT devices offer advantages in latency, power consumption, and privacy compared to big data analysis on cloud platforms. However, many IoT devices, especially microcontrollers, have significantly limited memory (measured in kilobytes) and CPU processing capabilities (in megahertz). As a result, these devices are frequently used as both real-time data collectors and preliminary analysis platforms while also serving as intermediary nodes for transmitting control signals to actuators. The demand for computational resources and the large number of parameters in conventional deep learning models substantially limit AI applications to current IoT devices.

Here, we demonstrate the potential of tiny deep learning on MCUs for real-time seismic source discrimination. Our model discriminates among ambient noise, quarry blasts (i.e., explosions), and natural earthquakes using just the vertical component of seconds-long raw waveforms. We first propose an evolutionary deep learning method that constructs a population of lightweight neural networks balancing prediction error, model size, and inference time. The final evolved model, named MCU-Quake, contains only 2693 parameters, representing approximately 0.09% of the parameters in the Vision-Transformer model19. We evaluate MCU-Quake on various datasets alongside edge-computing-friendly alternatives, including cutting-edge E3WS1 (a classical machine learning model), LEQNet (a deep learning variant of EQTransformer18), and a traditional detection method known as short-time average/long-time average (STA/ LTA)32. The MCU-Quake achieves comparable accuracies to large models while demonstrating significant advantages for real-time monitoring on MCUs. This advantage is particularly evident when considering that LEQNet’s inference energy consumption exceeds that of MCU-Quake by over 2700 times. We propose that the integration of AI into IoT devices (AIoT), leveraging compact AI models, could greatly enhance early warning systems for sensitive events. This approach is particularly promising due to its potential for low cost, minimal time delay, and reduced energy consumption.

Results

Evolutionary deep learning progress

Given a period of raw 1-dimensional (1D) signals, our proposed method delivers a population of neural networks that have evolved over eight generations. Although machine learning accuracy was an important focus, multi-objective constraints were also emphasized. The detailed results in Fig.��1 show the neural networks evolved using a 7 s period dataset. These networks for encoding signals evolved through minimizing fitness values (Fig. 1a, b), which consists of three decision factors: model size, prediction error, and inference time (Fig. 1c). As indicated by the red scatter in Fig. 1c, the desired solution, known as the Pareto front, contains a group of neural networks that achieve a balance among the three decision factors.

Fig. 1: Evolutionary deep learning progress.
figure 1

a Fitness statistics of neural network individuals for (a) all the population and (b) hall-of-fame population (a group of 16 individuals with the lowest fitness value) in the current generation. c 3D visualization of the individuals in a multi-objective space defined by model inference time, number of parameters (size), and prediction error. A lower value is better. d Architecture diagram of the final model (individual) used in this study.

Initially, there was a population of 256 individual neural networks with unique architectures. Architecture parameters, such as the number of different 1D convolution kernels, feedforward layers, corresponding layer size, pooling size, and stride step are all randomly initiated variables. Correspondingly, the population fitness distribution shape (generation 0 in Fig. 1a) indicates that the individual network performance is mainly concentrated around the median. With evolutionary updates of the selected networks’ architectures, new individuals displayed increasingly smaller fitness values in subsequent generations. As shown in Fig. 1a, with increasing generations, the wider bottom section of the violin plot represents a higher proportion of individual networks with better signal discrimination abilities. Upon closer examination (Fig. 1b), the distribution of the 16 individual networks with the lowest fitness values in each generation displays that an increasing number of superior networks are generated during the evolution processes. Ultimately, a group of neural networks, whose performance was not dominated by any other individual (red scatter and surface in Fig. 1c), was mapped into a multi-objective space defined by the aforementioned three decision factors.

In the Pareto-optimal set, an individual born in the fifth generation with an ID of 20 survived to the end. Unexpectedly, the individual uses only one numerical value to differentiate signals from noise, quarry blasts, and natural earthquakes. We name this network MCU-Quake. The architecture of MCU-Quake is shown in Fig. 1d (open-source code available). MCU-Quake is composed of a 1D-convolution block that transposes raw signals into latent codes, and a feedforward network unit used to encode these latent codes as critical embedding value for specific events. With memory usage of approximately 10 KB for its parameters, MCU-Quake can be deployed on most microcontrollers. For brevity, we discuss MCU-Quake in detail in this study.

Benchmarks on multiple datasets for edge-computing friendly models

Figure 2 summarizes inference metrics for edge-computing friendly models across multiple datasets. The MCU-Quake demonstrates its capability to not only identify signal sources on the UUSS dataset but also accurately detect natural earthquake arrivals on both the UUSS and STEAD datasets. In contrast, other models such as E3WS, LEQNet, and the STA/LTA method are limited to detect seismic arrivals.

Fig. 2: Confusion matrices of edge-computing friendly models/methods on various datasets.
figure 2

The subtitles denote the model/method name, dataset name and number of waveform components used. The values adjacent to the rows and columns represent calculated inference metrics. A greater value is better. UUSS: the University of Utah seismograph stations. STEAD: STanford EArthquake Dataset. UUSS dataset is composed of events including ambient noise (NO), quarry blast (QB), and natural local earthquake (LE). Our proposed model, MCU-Quake, demonstrates the ability to detect seismic arrivals and discriminate signal sources, as illustrated in subplots (a) and (b). In contrast, other models only detect seismic arrivals. Subplots (c)–(f) show MCU-Quake’s detection performance in distinguishing between noise and seismic signals. The performance metrics of STA/LTA, E3WS, and LEQNet on the UUSS data are presented in subplots (g)–(i), respectively. Similarly, the metrics for these three models on the STEAD data are displayed in subplots (j)–(l), respectively.

Notably, MCU-Quake is the sole deep learning model specifically designed for inference using single-channel (vertical component) seismic waveforms. For seismic source discrimination on the UUSS dataset (Fig. 2a), MCU-Quake achieves precision scores of 87.34% and 84.18% for quarry blasts and natural earthquakes, respectively, with recall scores of 87.49% and 84.74%. In terms of seismic detection, MCU-Quake attains precision and recall scores of 99.37% and 98.96% on the UUSS dataset (Fig. 2a or d) and 98.02% and 96.44% on the STEAD dataset (Fig. 2c). Conversely, the classic STA/LTA method is known to be heavily impacted by noise and preset threshold, resulting in best recall scores of 81.14% and 86.85% for the UUSS dataset (Fig. 2g) and the STEAD dataset (Fig. 2j), respectively.

Using three channels of raw seismic waveforms, MCU-Quake exhibits nearly identical performance as when using only the vertical component on the UUSS dataset (Fig. 2a, b). For seismic detections, MCU-Quake maintains comparable performance on the partially used UUSS dataset for training to that observed with the global STEAD dataset. However, E3WS, originally trained using carefully processed waveform features, achieves a low precision score of 81.8% for ambient noise on the UUSS dataset (Fig. 2h) and 92.5% for natural earthquakes on the STEAD dataset (Fig. 2k). In comparison, LEQNet, a lightweight variant of EQTransformer, shows similar performance to MCU-Quake across both datasets, aside from a slightly lower precision score for ambient noise on the UUSS dataset (95.39%, Fig. 2i).

Overall, MCU-Quake achieves seismic detection metrics akin to cutting-edge machine learning models (F1-score and accuracy in Table 1). However, MCU-Quake’s computational resource requirements are significantly lower than those of other models. The scrolling input signal for MCU-Quake spans a mere 7-second period of single-component raw data, compared to the 60 seconds of highly filtered three-component waveforms required by LEQNet and the 140 manually-defined attributes dynamically calculated using 10 seconds of three-component waveforms for E3WS. Furthermore, the number of neural-network parameters in MCU-Quake is only 2693, equating to approximately 6.8% of those in LEQNet. Consequently, MCU-Quake’s memory usage in a 32-bit system is as low as ~10 KB. This substantially reduces hardware costs for MCU-Quake implementation, typically amounting to around $6 USD for many microcontrollers compared to the $600 USD cost associated with GPUs required by traditional neural-network models.

Table 1 A summary of major differences among models

Energy efficiency is evaluated for the aforementioned models and methods. All models can be deployed on single-board computers, which generally possess high-performance CPUs and gigabytes of memory. However, only the classic STA/LTA method and MCU-Quake are compatible with microcontrollers featuring limited CPU clock frequencies (MHz) and memory (kilobytes). As shown in Fig. 3a, MCU-Quake demonstrates a computing efficiency comparable to that of the classic STA/LTA method. On the selected single-board computer (Raspberry Pi 4B, left panel in Fig. 3b), energy consumption for a single input to LEQNet is 334.702 mJ, while that of MCU-Quake (based on Gaussian density function) amounts to 0.121 mJ. The corresponding inference time-lapse values are 350.867 ms and 0.224 ms for LEQNet and MCU-Quake, respectively. Device idle power is approximately 2.3 watts, over ten times larger than that of the selected microcontrollers (right panel in Fig. 3b). Despite the low CPU clock frequency of the microcontroller (Arduino Nano 33) at only 64 MHz (Table 2), MCU-Quake executes inference within a time-lapse of approximately 128 ms. As illustrated in Fig. 3b, when using limited computing resources scaled to microcontrollers, only the classic STA/LTA method and MCU-Quake can facilitate real-time seismic signal detection.

Fig. 3: Computational resource consumption of different models.
figure 3

a Time efficiency (Eff.), memory usage (Mem.) and key inference metrics in a single board computer (Raspberry Pi 4B). ACC: accuracy, F1: F1-score, TPR: true positive rate, PPV: positive predictive value. The values are normalized, the greater the better. b Energy consumption of models/methods deployed on the single board computer (left panel) and two microcontrollers (right panel, ESP32-S3). The shaded area represents delayed response to real-time scenario, which is re-scaled for illustration using energy consumption of a single task (blue bars) of MCU-Quake deployed in the microcontroller. K: inference method based on Kernel density equation, G: inference method based on Gaussian density function.

Table 2 Measurement of computational resource requirements for models/methods deployed on edge-computing devices

Learned knowledge for discriminating signal sources

The MCU-Quake model acquires three numerical values: −5.01, 1.96, and 1.01 from events monitored in western USA (the UUSS dataset), representing typical ambient noises, quarry blasts (explosions) and natural earthquakes, respectively (Fig. 4a). MCU-Quake generates an embedding value for each input signal it receives. The three typical values are mode values of corresponding distributions from the UUSS training dataset’s embeddings. The source of a given period of signals can be determined by calculating the likelihood of the input embedding using density functions shown as solid curves in Fig. 4a. Consequently, MCU-Quake discriminates signals by comparing the distance between the input embedding value and these three typical values.

Fig. 4: Knowledge representation of signal sources.
figure 4

The knowledge representation of signal sources is displayed in (a). Extracted embeddings of STEAD dataset using MCU-Quake is displayed in (b). Extracted embeddings of Russia–Ukraine war dataset using MCU-Quake is displayed in (c). The solid curves shown in (a)–(c) represent the probability densities of learned knowledge representations for signal sources from the UUSS train dataset. The histogram bars, on the other hand, depict the extracted embeddings generated by MCU-Quake using UUSS train dataset, STEAD test dataset, and Russia–Ukraine war events, respectively. The discriminative capacity of MCU-Quake as a function of the signal reception interval on UUSS test dataset using (d) only vertical-component and (e) three-component waveforms. f A demonstration of real-time inference on a microcontroller. The solid curves colored by green, grey and orange represent prediction scores for ambient noise, quarry blast and natural earthquake, respectively.

Interestingly, even though MCU-Quake has not been trained on distinct global earthquake events (the STEAD dataset, also see Fig. 5b), all the embeddings of ambient noises and most earthquake signals from the STEAD dataset distributed within the learned Westen USA distributions (histogram bars in Fig. 4b). On the other hand, events from the Russia-Ukraine war in Europe split into explosion and earthquake distributions (Fig. 4c), with some believed to be false negatives for explosions (see discussion in detail). The extracted embeddings demonstrate MCU-Quake’s generalization ability and provide evidence for its learned knowledge in discriminating signal sources.

Fig. 5: Review of Russia–Ukraine war events.
figure 5

Daily count of (a) typical explosions and (b) speculative earthquakes identified by MCU-Quake. The maps presented in (c)–(f) depict event locations, represented as colorful scatters, plotted against date. White triangles indicate the positions of seismic array sensors. In subplots (e) and (f), the shaded areas delineate regions that are under occupation by Russian troops.

To investigate MCU-Quake’s discriminative capacity on the UUSS test dataset, we display embedding distributions for input signals containing increasing coverage of seismic waveforms (Fig. 4d, e). For signals with 0.5 s of seismic waveforms, MCU-Quake cannot distinguish key signal sources, as shown by overlapping distributions from using vertical-component datasets (Fig. 4d) or stacked scatters from three-component datasets (Fig. 4e). Signal sources become distinguishable for input signals containing at least 2 s of seismic waveforms. As expected, the best discrimination performance is achieved with input signals containing 7 s of seismic waveforms.

During real-time inference, MCU-Quake can detect seismic arrivals in seconds and estimate signal sources using dynamically scrolling input windows (Fig. 4f). In Fig. 4f, the red shaded areas to the left and right denote noise and the continuation of a weak earthquake signal. The MCU-Quake effectively recognizes the signal source as a ground truth earthquake with high confidence. This demonstrates that even after some time following seismic arrival, the MCU-Quake is able to accurately identify the signal source as an earthquake rather than ambient noise.

Discussion

Interpretability of the model

Only using seconds of the vertical component of raw signal waveforms is an outstanding feature of MCU-Quake. Though seemingly imperceptible, the signal interval holds intrinsic significance. Linville et al. (2019) identified the areas in waveform spectrograms that cause significant influence on average prediction (Fig. 4 in Linville et al.5). Their findings reveal that the vertical component alone is adequate for discriminating between quarry blasts and local earthquakes, using a time window of under 10 s. Using models hundreds of times larger than MCU-Quake, they achieved an accuracy of 84.7 ± 0.7% with a time window of 9 s. On the other hand, Lara et al. (2023) analyzed the most import features for the E3WS concerning window duration from 3 to 46 seconds (Fig. 10 in Lara et al.1). In the initial 3 s post-seismic arrival, four out of five manually-defined features were derived from the vertical component. Although these features were evaluated for magnitude estimation, they underscored the importance of the vertical component. The MCU-Quake embeddings, extracted from two horizontal components (E and N), further demonstrates the superiority of the vertical component data (see Supplementary Fig. 1).

The interpretability and robustness of model inference are paramount in safety-critical applications such as real-time event surveying and decision making. Unfortunately, most machine learning models operate as black boxes, merely producing values between 0 and 1 with no discernible meaning. The prevalent approach is to establish empirical thresholds for classification. For example, LEQNet and EQTransformer set their respective thresholds at 0.5, 0.3, and 0.3 to classify seismic events, P-wave and S-wave. While E3WS gains some interpretability with its 140 input features from waveforms, its performance highly depends on the trigger threshold. For seismic detection, E3WS attained peak performance (99.9%) using well-processed waveforms at a trigger threshold of 0.211. However, this study witnesses commendable E3WS performance (90.61%) with raw waveforms at a threshold of 0.8. On the other hand, MCU-Quake employs a contrastive learning approach33,34 to maximize differences between signal sources. By mapping all input signals into distributions (Fig. 4a), MCU-Quake can estimate the likelihood of a signal source using probability density functions. As seen in Fig. 4a, overlapping distributions between explosion embeddings and earthquake embeddings signify indistinguishable seismic events. Such occurrences are common, particularly for low signal-to-noise ratio events.

Review of Russia–Ukraine war using 1-component seismic array data

The MCU-Quake offers a means to review typical seismic events in the field. In Fig. 5, we review Russia–Ukraine war using seismic array data published by Dando et al.3. Dando et al. (2023) automatically detected and located 1282 explosions from 24 February to 3 November 2022, in a region of northern Ukraine (Fig. 5c–f). However, as noted by Dando et al. (2023), there are numerous false positives due to the low signal-to-noise ratio of events, and some explosions cannot be observed in the waveforms. The local magnitude of most events is only approximately 0.2. Using MCU-Quake, we extract waveform embeddings and map the results in Fig. 4c. Our model detects 1276 events as seismic occurrences, with 371 events identified as typical explosions in northern Ukraine despite trained on quarry blast events in western USA. Dando et al. (2023) reported 64 explosions on 7 March with highest activity and a final heavy bombardment on 31 March. In comparison, MCU-Quake identified 15 typical explosions on 7 March and one explosion event with a local magnitude of 0.68 on 31 March. Although it appears that MCU-Quake missed many claimed explosions by Dando et al. (2023), this discrepancy may be attributed to substantial differences in coupling and energy propagation between surface and underground explosions. Despite these data limitations, MCU-Quake provides real-time results aligning with the war events over several months.

Fig. 6: Dataset specification.
figure 6

a Geographical distribution of seismic events from the UUSS dataset (green), STEAD dataset (gradient color), and Russia–Ukraine war events (cyan). b Statistical analysis of dataset features. ce display raw waveforms used as inputs for the MCU-Quake, E3WS, and other commonly employed deep learning models in seismic signal detection. Time intervals devoid of shading denote segments used as seismic input for these models.

Tiny deep learning for edge computing devices

In traditional deep learning applications, variants of long short-term memory (LSTM) units for recurrent neural networks (RNN) are commonly used for sequence data such as language, audio35 and seismic signals5. However, LSTM and RNN demand substantial computational and memory resources due to their operational complexity and the numerous internal variable states36,37. Although some attempts have been made to alleviate computational complexity38, the weakness of memory-related operations, including attention mechanisms, leads to a significant increase in computational cost with model size. Reports indicate that GPT-3 models trained on large graphics-processing-unit (GPU) clusters take several months to complete approximately \(3\times {10}^{23}\) flops39. In contrast, classical image classification models containing millions of parameters consume hundreds of megabytes of memory40.

Tiny machine learning emerges as a promising research topic. Our evolutionary deep learning method is a neural network architecture search (NAS). Despite NAS activities causing considerable CO2 emissions from training large numbers of large models41, all the tiny models in this study were trained quickly using only CPUs. Conventional neural networks require high memory bandwidth for large data blocks, necessitating expensive specialized hardware like GPUs or TPUs for large models. However, our tiny machine learning system’s data communication overhead using CPUs was significantly smaller than that using GPU devices. In contrast to conventional NAS methods that primarily emphasize accuracy metrics, our approach achieves multiple objectives through evolutionary optimization and the application of interpretable constraints. Despite a vast search space exceeding 10 billion possibilities, CPU-based parallel computing offers an affordable solution for our evolutionary deep-learning mission. Our evolutionary computing was conducted on a cloud platform equipped with multiple CPUs. The mission evaluated approximately 2000 individuals in parallel, with each individual taking an average training time of approximately 1.4 min.

The compact design and minimal hardware requirements of MCU-Quake facilitate efficient deployment on low-power, cost-effective MCUs. In contrast, other research has developed hardware-exclusive models on field programmable gate array (FPGA) devices42. FPGA-based systems typically exhibit a significantly higher expense compared to their MCUs counterparts, costing several tenfold more. The MCU-Quake, which is universally compatible across various hardware systems, can provide real-time similarity scores with uncertainties between input signals and various known sources, as demonstrated in Supplementary Movie 1. As summarized in Table 2, an Arm Cortex-M4 device operating at a CPU frequency of 64 MHz conducts inference within approximately 128 milliseconds. The ESP32 S3 device, priced at around six USD, reduces the inference time to approximately 68 milliseconds. On the single board computer (Raspberry Pi4 Model B), MCU-Quake is capable of real-time inferences in 0.2 ms. Although the single-board computer exhibits significantly superior computing performance compared to MCUs, these edge computing devices possess considerably fewer resources and lower costs than conventional personal computers. Of particular note, the power consumption measurements outlined in Table 2 range from approximately 0.1–2.3 W, indicating that all devices can be solar-powered as artificial intelligent sensors in field deployments.

The proposed method is inspired by Fourier transformation of time-series signals. While continuous signals in the time domain can be split into several sinusoids with certain frequencies in the frequency domain (Supplementary Fig. 2), main frequencies of natural signals are unknown and complex (Supplementary Fig. 2b). Some studies have used convolutional neural networks (CNN) to extract key features from spectrum images of time signals5,12,13,14,15,16. Conventional 2D convolution operations are computationally costly, with a complexity of \({{\rm{O}}}({N}^{2}{n}^{2})\) for each kernel per layer, where \(N\times N\) is input sample size and \(n\times n\) is the convolution kernel size. Researchers integrate Fourier transformation algorithms into neural networks to process features in the frequency domain with a complexity of \({{\rm{O}}}({N}^{2}{{\mathrm{log}}}_{2}N)\) to \({{\rm{O}}}({N}^{2}{{\mathrm{log}}}_{2}n)\)43,44. However, efficiency increases are negligible when \(N\gg n\) as is the case in CNN44. More importantly, computational resources available for MCUs could not provide intermediate computations for real-time missions. Consequently, we propose a feature extraction method with a complexity of \({{\rm{O}}}({Nn})\) to merge key information from time and depth direction of the feature space (Fig. 7a). Our method increases inference accuracy by approximately 11% on the test dataset compared to conventional convolution operation on time-series data.

Fig. 7: An evolutionary deep learning solution.
figure 7

a A scalable model architecture prototype designed for embeddings raw signals. b A schematic representation detailing our evolutionary modelling grounded in contrastive learning techniques and cloud-edge collaborative computing paradigms.

The MCU-Quake only extracts embedding values using a fixed period of data without considering long-term time-dependent and spatial relations in signals. Although our simple machine learning operations show remarkable discrimination performance, elaborate methods that incorporate time and spatial relations between different sensor stations are believed to contribute significantly. The current version of MCU-Quake is only able to detect seismic arrivals and discriminate signal sources. However, seismic events follow an extremely skewed distribution. That is, the proportion of seismic duration over a long period of time normally does not exceed ~1%45. Consequently, MCU-Quake is particularly suitable as a long-running trigger model that invokes large models occasionally in a cascading fashion, achieving maximum capabilities at optimum energy efficiency46.

The ability to run MCU-Quake on low-cost hardware is a notable achievement, but we believe its lightweight nature also enables broader applicability. Notably, our model can be effectively deployed on older computational devices still in use in some seismic monitoring setups, thereby expanding its impact and utility. This extended compatibility is particularly significant for existing seismic stations with legacy hardware that cannot support more resource-intensive models. By adapting MCU-Quake to these systems, we can enhance their capabilities without requiring substantial upgrades or replacements. Furthermore, integrating our model with other sensor technologies, such as GPS and environmental sensors, could provide a more comprehensive monitoring system. With its high accuracy, efficiency, and versatility, we believe MCU-Quake is an attractive solution for researchers and practitioners across various fields related to earth and environment monitoring.

Methods

Dataset

Three raw waveform datasets are employed: the localized University of Utah Seismograph Stations (UUSS) dataset for modeling and testing, the global Stanford Earthquake Dataset (STEAD) for generalization testing, and the Russia-Ukraine war dataset for case study. For comparison with other published models requiring three-component seismic waveforms, 13497, 76374, and 1335 waveforms meeting the same conditions were randomly selected to generate UUSS, STEAD, and Russia–Ukraine war datasets, respectively. Figure 6a displays the distributions of all these events on a world map. Figure 6b shows detailed statistics of these three-component datasets used for benchmarking, indicating different characteristics between the UUSS dataset and the STEAD dataset. Time intervals specific to each model’s input waveforms are used for fair comparison (Fig. 6c–e). Shaded areas in Fig. 6c–e represent unavailable signal data with respect to the specific model. All datasets have balanced quantities of ambient noise and seismic waveforms.

Our study uses only one vertical component of raw waveforms for modeling. We collect 83,442 manually reviewed seismic records labelled as quarry blasts or local earthquakes by UUSS analysts18. The vertical channel waveforms are downloaded from the Incorporated Research Institutions for Seismology (IRIS) using web services. Our deep-learning dataset is derived from waveform data that are preserved in their original format, as raw records with no correction for instrument response and frequency filters applied. The sole preprocessing step involves detrending the raw waveforms.

Our method aims to learn a representation that maximizes differences among various signal types. Therefore, we construct a vertical-component UUSS dataset by randomly selecting waveform combinations from 83,442 downloaded records. Each sample contains four waveform periods: reference, positive, negative, and noise waveforms (refer to Fig. 7b and Supplementary Fig. 3). The reference waveform is sliced from a randomly chosen record starting at the initial P-wave arrival. The positive waveform is sliced from another randomly selected record sharing the same label as the reference waveform. Conversely, the negative waveform is sourced from a distinct, differently labeled record. The noise waveform consists of the ambient data preceding the first P-wave arrival. Each data point undergoes normalization using the maximum value within a 9-second window following the first P-wave arrival and resampling at 100 Hz. This process yields training and validation datasets containing 65,065 and 7,165 samples, respectively. Note that our training dataset does not overlap with test data used for published models.

Evolutionary deep learning solution

We design a light-weight model template that is applicable to MCUs (Fig. 7a). The model template consists of a convolution block for extracting and merging features from a 1D signal, and an embedding unit is used to transform the features into a 1D embedding code. The convolution block contains five 1D convolution kernels of variable sizes M, N, O, P, and Q. First, each kernel convolves with the input signal along the time dimension to generate a feature map, as illustrated by the colorful codes in the convolution block in Fig. 7a. Each feature map has a length shape determined by a convolution operation (time direction) multiplied by the number of kernels (depth direction). Subsequently, the two 1D-tensors of the mean values of each feature map are concatenated along the time and depth directions as one 1D-tensor. To reduce memory usage and the number of parameters in the following embedding unit, a global max-pooling operation is applied to a list of the concatenated 1D-tensors above, resulting in a 1D array of latent codes. Finally, the latent codes are input into the embedding unit, which consists of layers of fully connected neural networks, to generate a 1D embedding of the input signal.

In summary, there are 11 variables to be determined by the evolutionary optimization algorithm, including the convolution stride, existence of convolution kernels (\(1\times 1\), \(1\times 3\), \(1\times 5\), \(1\times 7\), \(1\times 9\)), pooling size, pooling stride, and the number of parameters in each of the three fully connected neural networks. If a negative value is assigned to a variable by the evolutionary algorithm, then the element represented by the variable does not exist in the model. The 11 variables comprise a chromosome that determines the neural network architecture (individual embedding model).

As displayed in Fig. 7b, the cloud-computing platform first generates a population of individual models using different chromosomes. Each model is trained and evaluated using a contrastive learning method. Specifically, a training sample consisting of three signal periods is input into an individual model, resulting in three corresponding embeddings. The individual models are optimized to maximize the distance between the embeddings of different events using Eq. (1):

$${L}_{M}=\, {L}_{1}+{L}_{2}\\ {L}_{1}=\, \max \left({D}_{{RP}}-\quad {D}_{{RN}}+\alpha ,0\right)\\ {L}_{2}=\, \max ({D}_{{RP}}-\beta \cdot {D}_{{RA}}+\alpha ,0)$$
(1)

where \({L}_{M}\) is model loss for updating network parameters, \({D}_{{RP}}\) is distance between embedding tensor of reference waveform and embedding tensor of positive waveform, \({D}_{{RN}}\) is distance between embedding tensor of reference waveform and embedding tensor of negative waveform, and \({D}_{{RA}}\) is distance between embedding tensor of reference waveform and embedding tensor of ambient noise, \(\alpha\) is a margin in the distance space, and \(\beta\) is a hyperparameter to scale the distance between two embedding tensors which have maximum distance. We use 1D Wasserstein as the distance metric. Mathematically, the Wasserstein metric is a measure of how the probability distribution \(x \sim P\) (source) moves to \(y \sim Q\) (target). In 1D case, there is an optimal transport route such that the Wasserstein distance can be simply written as \({W}_{p}\left(P,Q\right)={\left({\sum}_{i=1}^{n}{d\left({x}_{i}^{* },{y}_{i}^{* }\right)}^{p}\right)}^{1/p}\), and we choose \(d\left({x}_{i}^{* },{y}_{i}^{* }\right)=\left|{x}_{i}^{* }-{y}_{i}^{* }\right|\) and \(p=1\) in calculation, where \({x}_{i}^{* }\) and \({y}_{i}^{* }\) are the sorted embedding codes and \(n\) is the number of codes in an embedding.

Subsequently, each individual is assigned with a fitness value defined as:

$${F}_{M}=\left(1-{V}_{{acc}}\right)+\gamma \cdot \left(\frac{N}{{N}_{s}}\right)+\delta \cdot {e}^{\frac{{T}_{b}}{{N}_{t}}}$$
(2)

where \({F}_{M}\) is the model fitness, \({V}_{{acc}}\) is the model accuracy on validation dataset, \(N\) is number of model parameters, \({N}_{s}\) is a norm for \(N\), \({T}_{b}\) is model inference time for a batch of samples, \({N}_{t}\) is a norm for \({T}_{b}\), \(\gamma\) and \(\delta\) are weights.

The algorithm sorts individuals based on their fitness values and selects some individuals to create a new generation of individuals using crossover and mutation operations. Here, an initial population of 256 individuals, a maximum generation of 8, a crossover probability of 90% and a mutation probability of 80% is set. For each generation, 16 individuals with the lowest fitness values (hall-of-fame population) are retained for the next generation. The evolutionary optimization is executed in parallel on a cloud-computing platform equipped with multiple CPU cores. The pseudo-code of the evolutionary process is summarized in Supplementary Table 1. The evolutionary program is implemented using Python, DEAP47 and TensorFlow48. More information about training and inference is provided in the open-source code.

Real-time inference on microcontrollers

Two microcontrollers are selected to demonstrate the real-time signal source discrimination. As displayed in Fig. 8, the two devices are connected to a thin-film-transistor (TFT) display device, respectively. Every second, a signal period, up to a certain number of seconds, is input to the model deployed on the microcontrollers. The microcontrollers first extract the signal embedding and subsequently calculate the likelihood of the embedding source using the probability density functions for the known sources. The inference results are displayed synchronously on the TFT screen. The time elapsed for each inference is sent to a monitor through a serial port on the microcontrollers. A movie is attached as additional supporting information for the illustration (see Supplementary Movie 1).

Fig. 8: Wiring diagram for visual demonstration of real-time inference on microcontrollers.
figure 8

An identically trained deep learning model is deployed on (a) ESP32-S3 and (b) Arduino Nano 33 BLE board, respectively. The Thin-Film-Transistor screen pins, detailed in two columns, provide wiring guidance.