Introduction

Scientific literature is imbued with a wealth of knowledge content and relationships, including core elements such as the research background, objectives, questions, methods, theoretical advancements, conclusions and future work in specific fields. In the big data era, the volume of scientific literature is expanding exponentially. Therefore, it is of great significance to automatically and effectively extract, integrate, and utilize the deep knowledge contained within a vast amount of scientific literature.

The fine-grained knowledge content extracted from scientific literature is referred to by various names, including knowledge entities [1, 2], knowledge elements [3], key insights [4], scientific entities [5] etc. Despite the different names, their common goal is to capture and summarize the core information and fundamental components within scientific literature. And in this paper, we refer to them as knowledge entities.

As the smallest independent units that summarize key concepts and core elements in scientific literature, knowledge entities not only help researchers quickly identify and understand the latest advancements in specific fields, but also can be applied for further academic research and trend analysis. Therefore, under the fourth paradigm of data-intensive science [6], more and more researchers are utilizing semantic technologies like text mining, natural language processing and deep learning to mine knowledge entities hidden in scientific literature for more fine-grained research.

Named Entity Recognition (NER) serves as a fundamental task of NLP, designed to identify entities with specific meanings from unstructured texts and sort them into predefined categories. Therefore, the extraction of knowledge entities from scientific literature can be considered a subtask of NER. Amidst the swift advancements in artificial intelligence technology, the paradigm of named entity recognition has shifted from initial rule-based methodologies [7, 8] towards more automated extraction techniques. And deep learning-based approaches have demonstrated remarkable performance in the NER task. However, named entity recognition of Chinese scientific literature in the general domain is still facing challenges in two aspects, namely limitations of general-domain corpora and the complexity of entity boundaries.

The first challenge lies in the lack of high-quality annotated corpora. Existing general-domain datasets are often limited in scale and coverage, with many either unreleased [4], confined to a narrow range of disciplines [9], or defined with oversimplified schemas that omit key entity types such as research questions and theoretical principles [10]. Notably, as most existing datasets are predominantly in English, their applicability to scientific literature in other languages, such as Chinese, remains limited. In addition, constructing such datasets typically involves annotators from diverse disciplinary backgrounds, which introduces annotation inconsistencies and span-level ambiguity. These issues not only hinder the accurate training of models but also reduce generalizability across domains, highlighting the need to build a refined Chinese corpus with broader semantic coverage and domain diversity.

The second challenge arises from the linguistic characteristics of Chinese and the structural complexity of scientific texts. Due to the absence of explicit word delimiters [11], word segmentation becomes ambiguous and error-prone, leading to unreliable entity boundaries [12]. In addition to segmentation ambiguity, nested entities are prevalent in scientific literature, where one entity often contains another entity or several other entities of the same or different categories [13], as illustrated in Fig. 1. These nested structures significantly increase the difficulty of boundary detection. And these challenges are further compounded by annotation inconsistencies. Moreover, most NER approaches do not incorporate span-length constraints during training process, which may lead them to incorrectly predict overly long spans as entities in these complex nested scenarios.

Fig. 1
figure 1

Examples of nested entities (Categories: Methods and Models, Research Questions, Data). In the first example, the term "Attention Mechanism" is nested within the term "Modality Fusion Model based on Attention Mechanism", and both entities are of the type Methods and Models. In the second example, the term "Named Entity Recognition", categorized as a Research Question, is nested within "Named Entity Recognition Dataset", which falls under Data

To tackle the challenges outlined above, we constructed a refined corpus for Chinese scientific knowledge entity recognition and developed an end-to-end model termed StructBERT-AT-DMGP, which is designed to alleviate the impact of annotation noise and handle complex entity boundaries. Specifically, we incorporate adversarial training as an implicit regularization method, introducing controlled perturbations to the input representations to improve the model’s generalization in the presence of label noise. For the entity boundary problem, we adopt the Global Pointer framework, which formulates NER as a binary classification task over all possible spans involving word-level information, thereby eliminating the need for Chinese word segmentation and naturally supporting the recognition of nested entities. To further improve boundary recognition precision, we enhance this formulation with a Dual-Masked mechanism that filters out span candidates exceeding the maximum entity length observed across the corpus during training process, effectively constraining the search space and avoiding overlong predictions.

Our key contributions are outlined as follows:

  • We propose a more comprehensive and semantically enriched schema for Chinese scientific knowledge entity recognition, and construct a high-quality annotated corpus aligned with it. Compared to existing efforts that are often limited in scope or rely on simplified label definitions, our schema defines nine fine-grained entity categories that capture core components of scientific research and spans eight representative scientific domains. The resulting corpus offers broader semantic coverage and domain generality, serving as a foundational resource for advancing Chinese scientific literature analysis.

  • We propose a Dual-Masked mechanism that explicitly incorporates entity length information into the training process to constrain the span prediction space. This mechanism mitigates the tendency of models to overpredict in nested or structurally complex cases, thereby improving boundary recognition accuracy without introducing additional model complexity.

  • We strategically incorporate adversarial training to mitigate the adverse effects of label noise caused by unavoidable annotation errors and inconsistencies in general-domain scientific NER. As an implicit regularization method, this mechanism enhances the model’s noise tolerance by encouraging it to correctly classify both clean and adversarially perturbed inputs during training, thereby reducing overfitting to mislabeled data.

Related work

Knowledge entities in scientific literature

Scientific literature, serving as carriers of knowledge, encompasses a rich and diverse array of knowledge entities. Various types of knowledge entities have been categorized and defined by researchers based on the characteristics of their research field and the requirements of their research tasks. Generally, the emphasis of knowledge entities in scientific literature differs between specific and general domains.

In the specific domain, knowledge entities often possess a high degree of specialization, which means they are mostly professional terms or phrases specific to a certain field. For example, Luan et al. [14] introduced six categories for scientific entity labeling in the machine learning domain, including Task, Material, Metric, Method, Other-ScientificTerm, and Generic. In computational chemistry, Pang et al. [15] constructed the ChemBE corpus, which encompasses seven entity types such as compound, method, reaction, chemical bond, solvent, and pKa values. Loukachevitch et al. [16] presented a new biomedical dataset for nested NER, containing 37 types of entities from general and biomedical domains (genes, diseases, chemicals, medical procedures, devices, etc.). These datasets provide fine-grained annotations that align with their respective fields, making them effective for domain-specific tasks. However, their limited scope constrains their usefulness in cross-domain and interdisciplinary research.

To support interdisciplinary research, researchers have developed general-domain datasets that capture knowledge entities with broad applicability across multiple disciplines. These datasets emphasize functional attributes that transcend individual fields, enabling wider scientific applications. For instance, Augenstein et al. [9] built a dataset covering physics, material science, and computer science, comprising three types of entities: Task, Process (including equipment and methods), and Material (such as corpora and physical materials); Nasar et al. [4] proposed nine types of knowledge entities based on insights from scientific literature, including problem, research domain, methodology/algorithms/processes, datasets, tools, evaluation measures, results, limitation, and future extensions; Brack et al. [10] utilized the OA-STM Corpus, which comprises 110 articles across 10 different domains, and derived four core scientific concepts, namely Process, Method, Material, and Data.

Named entity recognition approaches

Named entity recognition tasks are typically addressed through three major approaches: sequence labeling, span classification and machine reading comprehension (MRC) [17, 18]. These approaches have contributed substantially to the progress of NER in both general and domain-specific texts. Nevertheless, challenges such as nested entities and annotation noise remain particularly acute in scientific literature. In the following, we briefly review sequence labeling and MRC-based approaches, and then focus on the methods based on span classification, which serve as the foundation of our proposed framework.

Sequence labeling formulates NER as a token-level classification task, where each token is assigned a label indicating entity boundaries and types [19]. In the context of Chinese NER, due to the absence of explicit word delimiters, this approach is typically applied at the character level [20, 21]. Currently, most sequence labeling models employ Conditional Random Fields (CRF) [22] as the tag decoder. As for the encoder, with advancements in deep learning, it has gradually shifted from conventional neural networks (e.g., CNN [23], LSTM [24], BiLSTM [25] ) to pretrained language models (PLMs). Additionally, the application of the attention mechanism in NER tasks has become more prevalent. For instance, Luo et al. [26] implemented the multi-head attention mechanism in Chinese NER, improving the capacity of their model to focus on locally important features. Zhao et al. [27] introduced the RSA-CANER model whose core architecture is ALBERT-BiLSTM-CRF, incorporating a self-attention mechanism that integrates the latent features of Chinese characters. Although character-level sequence labeling models can avoid segmentation errors, they often overlook word-level semantic information, which is essential for accurate recognition in Chinese NER [28, 29]. Moreover, their token-level predictions make it difficult to handle nested structures. Although several studies have extended sequence labeling to support nested entities via multi-layer or cascaded decoding architectures [30, 31], such approaches often incur high computational costs and introduce training complexity. Consequently, sequence labeling remains less efficient and flexible for tasks involving nested structures, motivating the development of more effective frameworks such as span classification and MRC-based methods.

MRC-based approaches offer a novel way to unify the recognition of nested and flat entities. The idea of MRC is to imitate the pattern of machine reading comprehension by constructing queries for each entity type, enabling models to locate entities in the input text by posing questions. Li et al. [32] pioneered the use of the MRC framework to the NER task and used BERT as the encoder. Tan et al. [33] introduced continuous prompts into the self-attention mechanism to reduce the dependence on handcrafted queries, while Zhang et al. [34] proposed FinBERT–MRC for Chinese financial NER, demonstrating the domain adaptability of the MRC framework. However, most approaches based on MRC framework overlook the interrelations between entities. Consequently, Duan et al. [35] proposed a model named FLR-MRC, which leverages a graph attention network to integrate label relationships and capture dependencies between entity types. Nevertheless, the construction of explicit queries for each entity type introduces additional input overhead, which may pose challenges when processing long or complex texts.

Span classification is another widely adopted approach for nested entity recognition. Most such methods can be roughly considered as two-phase approaches, where the first phase involves entity boundary identification and the second phase entails entity type classification. Early implementations of span classification relied on exhaustive enumeration of candidate spans within a sentence, followed by type classification [36,37,38]. The more prevalent approaches currently are based on pointer networks, which utilize start and end tokens to represent a span. For instance, CASREL [39] employs two independent binary classifiers for detecting the beginning and ending tokens of an entity respectively. Given the discrepancies in training and prediction when using two separate modules, Su et al. [40] proposed Global Pointer, which considers the start and end as an integrated whole. Based on Global Pointer, Li et al. [41] proposed the AT-CBGP model, which uses ChineseBERT [42] to introduce token features of glyph and pinyin, and incorporates adversarial training to improve their model’s ability to identify malicious samples; Zhang et al. [43] introduced the MGBERT-Pointer model, using the Multi-Granularity Adapter fusion technique in the encoding layer and integrating an Efficient Global Pointer in the decoding layer. Although span-based methods are effective in recognizing nested entities, they are still vulnerable to annotation noise, such as span boundary errors and inconsistent labeling, which are commonly observed in the annotation of knowledge entities in the general-domain scientific literature.

To improve the robustness of the model against noisy labels, various strategies have been proposed in the deep learning community. According to the survey by Song et al. [44], recent research efforts can be grouped into five main categories: robust architecture design, regularization techniques, robust loss functions, loss adjustment methods, and sample selection strategies. Among them, adversarial training has emerged as a widely used implicit regularization approach, which enhances model robustness by injecting small adversarial perturbations during training. In the context of NER, such perturbations can be introduced at the input level, through character-level modifications, or at the representation level, by adjusting vectors in the feature space [45,46,47]. As the former often leads to semantic distortion, representation-level perturbation is generally preferred in recent works [48, 49].

Methodology

This paper aims to automatically extract scientific knowledge entities that researchers are concerned about from Chinese scientific literature in the general domain, to support the subsequent in-depth and fine-grained studies. However, there are two main challenges to achieve this goal: (1) the lack of high-quality annotated corpora for general-domain Chinese scientific literature, along with inevitable annotation errors and inconsistencies when manually annotating; and (2) the absence of explicit delimiters and the prevalence of nested entities in Chinese, combined with the lack of span-length constraints in most NER methods, hinders accurate boundary identification and may result in overlong entity predictions.

Considering the lack of available corpora, we first define a semantically richer and more fine-grained schema comprising nine categories of scientific knowledge entities, and construct a high-quality annotated corpus spanning eight representative scientific domains. Subsequently, we propose a novel model termed StructBERT-AT-DMGP for the automatic recognition of Chinese scientific knowledge entities. In this model, adversarial training (AT) is incorporated as an implicit regularization method to improve the model’s robustness against noisy data caused by annotation errors and inconsistencies; Dual-Masked Global Pointer (DMGP) is introduced to improve boundary recognition by explicitly incorporating entity length information into the training process to constrain the span prediction space and reduce overlong or implausible span predictions.

Definition of scientific knowledge entities

To effectively capture the types of entities that are commonly emphasized across diverse scientific disciplines, this study defines a set of scientific knowledge entity categories designed for the general domain. While some existing datasets include general-domain entities, they may not fully reflect the types of entities that researchers frequently focus on in scientific research. To address this, we define entity categories that are characterized by clear descriptions and functional attributes to ensure broad applicability. These categories are centered on core elements of scientific inquiry, with particular emphasis on the functional roles these entities play in scientific literature. Based on these considerations, we define nine categories of scientific knowledge entities that align with the characteristics and requirements of general-domain scientific literature. The specific definitions are as follows:

  • Research Questions: Inquiries or problems that the research aims to address or solve, defining the scope and direction of the study.

  • Methods and Models: Methods include specific techniques, strategies, or procedures used in the research, and models refer to theoretical frameworks, mathematical representations or computational algorithms.

  • Theories and Principles: Foundational theories, guiding principles or speculative conjectures that underpin the research.

  • Metrics: Standards or measures used to evaluate, compare, or track the performance, quality, or outcomes of research findings, providing a basis for comparison, analysis, and evaluation of results.

  • Instrumentation: Tools, equipment, or devices employed in the research process to conduct experiments, gather data, or measure variables.

  • Software and Systems: Computer applications, platforms, and infrastructural components utilized in the research process.

  • Data: Information collected or used in research, including experimental results, observations, datasets, and any other form of evidence gathered to support research findings.

  • Scientists: Researchers, scholars, or practitioners who conduct research, contributing to the understanding of natural phenomena, advancement of scientific knowledge, and development of technology.

  • Locations: Geographical places or specific sites related to the research activities, including laboratories, field study sites, universities, or any other relevant locations.

Architecture of StructBERT-AT-DMGP model

The overall architecture of the proposed scientific knowledge entities extraction model StructBERT-AT-DMGP is depicted in Fig. 2. The model consists of four main modules, namely StructBERT as the encoder, adversarial training module, attention mechanism with rotary relative position encoding and Dual-Masked Global Pointer as the decoder.

Fig. 2
figure 2

The architecture of StructBERT-AT-DMGP

StructBERT

StructBERT [50], developed by Alibaba DAMO Academy, is an extension model of BERT [51]. It enhances BERT through structural pre-training, incorporating structural information like word-level and sentence-level ordering, thus enabling it to encode dependencies between words and sentences and have better generalization and adaptability. In this paper, StructBERT is utilized to encode the input sequence \(S=\{s_1,...,s_L\}\) with length L into embedding vector \(X=\{x_1,...,x_L\}\).

Adversarial training

Adversarial training (AT) is a potent defensive method for training robust models that exhibit resilience against adversarial attacks [52]. As an implicit regularization strategy, it enhances model robustness by training the model on adversarially perturbed inputs, thereby improving its tolerance to annotation noise and reducing the risk of overfitting. In this paper, we employ PGD-k attack [53] to add some disturbances during model training, encouraging the model to learn more stable decision boundaries under noisy supervision. PGD-k attack involves iteratively applying projected gradient descent (PGD) on a negative loss function for multiple steps:

$$\begin{aligned} & X^{adv}_{0} = X \end{aligned}$$
(1)
$$\begin{aligned} & X^{adv}_{t+1} = \Pi _{X+S}(X^{adv}_t+\alpha \operatorname {sign}(\nabla _{X}{\mathcal {L}}(X,y;\Phi ))) \end{aligned}$$
(2)

where X denotes the embedding vector encoded by StructBERT, \(X^{adv}_t\) represents the adversarial sample at the t-th iteration, which is equivalent to adding a perturbation \(\Delta x\) to the embedding vector X, as shown in the Fig. 2; \({\mathcal {L}}(X,y;\Phi )\) denotes the loss function and \(\Phi \) represents the model parameter set; \(\alpha \) represents the step size while \(\operatorname {sign}(\cdot )\) represents the sign function; \(\Pi \) represents the projection function for projecting adversarial samples to the permissible perturbation space S, where S is often chosen as an \(L-p\) norm ball centered around X with a perturbation radius of \(\epsilon \).

Attention mechanism

Besides StructBERT, the remarkable performance of numerous models have also proven the critical significance of the sequential order of words for natural language understanding. Among them, some models incorporate absolute position encoding into the input embedding through a pre-defined function [54], while some models utilize trainable absolute position embeddings [51]. Additionally, some other models utilize relative position encoding [55], which enables models to better understand the relationships of relative positions among words in a sequence, thereby capturing contextual information more accurately.

In this paper, we incorporate rotary relative position embedding (RoPE) [56] into the attention mechanism. This incorporation strengthens the model’s capacity to grasp the relative position information and improves its sensitivity to the length and span of entities, thus enabling more accurate differentiation of true entities. To be specific, RoPE encodes absolute positional information using a rotation matrix, integrating relative position dependencies into the formulation of self-attention mechanism.

Let the vector after passing through the dense layer be \(H=\{h_1,...,h_i,...,h_L\}\), where \(h_i \in {\mathbb {R}}^{2d}\), 2d is the hidden dimension and L is the sequence length. Then, we split the dense layer along the hidden dimension into two separate vectors, as illustrated in Fig. 2, yielding two sequences of vectors denoted by \(Q=[q_{1,\beta },...,q_{i,\beta },...,q_{L,\beta }]\) (Query) and \(K=[k_{1,\beta },...,k_{j,\beta },...,k_{L,\beta }]\) (Key), whose elements are computed as follow:

$$\begin{aligned} & q_{i,\beta } = W_{q,\beta }\cdot h_i + b_{q,\beta } \end{aligned}$$
(3)
$$\begin{aligned} & k_{j,\beta } = W_{k,\beta } \cdot h_j + b_{k,\beta } \end{aligned}$$
(4)

where \(q_{i,\beta } \in {\mathbb {R}}^{d}\), \(k_{j,\beta } \in {\mathbb {R}}^{d}\), \(i, j \in \{1,2,...,L\}\); \(W_{q,\beta }\) and \(W_{k,\beta }\) are trainable weight matrices, \(b_{q,\beta }\) and \(b_{k,\beta }\) are trainable biases, \(\beta \) represents the categories of entities. And then we introduce the rotary matrix [56] with pre-defined parameters \(\Theta = \{ \theta _k = 10000^{-2k/d}, k \in [1,2,...,d/2] \}\), \(R_i\) can be calculated as follow:

$$\begin{aligned} R_i = \begin{bmatrix} \cos i\theta _1 & \,\, -\sin i\theta _1 & \,\, 0 & \,\, \cdots & \,\, 0 \\ \sin i\theta _1 & \,\, \cos i\theta _1 & \,\, 0 & \,\, \cdots & \,\, 0 \\ 0 & \,\, 0 & \,\, \cos i\theta _2 & \,\, -\sin i\theta _2 & \,\, \vdots \\ 0 & \,\, 0 & \,\, \sin i\theta _2 & \,\, \cos i\theta _2 & \,\, \vdots \\ \vdots & \,\, \vdots & \,\, \vdots & \,\, \ddots & \,\, \vdots \\ 0 & \,\, 0 & \,\, \cdots & \,\, \cos i\theta _{d/2} & \,\, -\sin i\theta _{d/2} \\ 0 & \,\, 0 & \,\, \cdots & \,\, \sin i\theta _{d/2} & \,\, \cos i\theta _{d/2} \end{bmatrix} \nonumber \\ \end{aligned}$$
(5)

And Fig. 2 shows an example in the 2-dimensional case. Ultimately, we define the attention score of \(q_{i,\beta }\) and \(k_{j,\beta }\) as follow:

$$\begin{aligned} score_{\beta }(i,j)&= (R_iq_{i,\beta })^T(R_jk_{j,\beta }) \nonumber \\ &=q_{i,\beta }^TR_i^TR_jk_{j,\beta } \nonumber \\&= q_{i,\beta }^TR_{j-i}k_{j,\beta } \end{aligned}$$
(6)

where \(R_{j-i}\) represents rotary relative position encoding matrix and meets the equation \(R_i^TR_j=R_{j-i}\), \(j \ge i\).

Dual-masked global pointer

Global Pointer (GP) is a unified approach proposed by Su et al. [40] to cope with NER task, capable of effectively handling both nested and flat entity recognition issues. Regarding the entity boundary problems, Global Pointer considers the start and end indices as an integrated whole while the general pointer networks typically employ two separate modules to detect the start and end tokens, thus providing a more global view to avoid inconsistency in the training and prediction processes. The main idea of GP is to construct an entity matrix for each entity type, sized \(L\times L\), where L denotes the input sequence length.

As illustrated in Fig. 3, the lower triangle part of entity matrices can be omitted directly since the upper triangle part has adequately encapsulated the entity information. Additionally, for a certain entity type, the rows of the entity matrix represent the starting position of an span (denoted by i), while the columns represent the ending position of the span (denoted by j). Thus, marking the element at (ij) in the matrix as 1 indicates that the span at the position within the input sequence corresponds to an entity of a certain type, while marking it as 0 indicates it is not an entity. Since each element within the entity matrix corresponds to a span, every element in the entity matrix encompasses word-level information. And this exhaustive span approach eliminates the need for word segmentation and naturally resolves the issue of nested entities for the reason that every potential entity span would be predicted whether it is an entity or not.

Fig. 3
figure 3

Entity matrices of Global Pointer. The Chinese characters highlighted in yellow are translated into English as "Electronic Medical Record Dataset", and fall under the category of Data; the Chinese characters highlighted in green are translated into English as "F1-score", and fall under the category of Metrics

Fig. 4
figure 4

Entity matrices of Dual-Masked Global Pointer with \(\delta =8\). The sentence here is the same as the one in Fig. 3

Since knowledge entities are the smallest independent units that summarize the core elements of scientific literature and have a length limitation, not all positions in the upper triangular part of entity matrices need to be considered for whether they are entities. Inspired by this, Dual-Masked Global Pointer (DMGP) is proposed in this paper, which applies extra masks to the entity matrix to impose entity length constraints, thereby narrowing the prediction space and improving boundary recognition. Specifically, we introduce a hyperparameter \(\delta \) to control the expected length of potential entities, masking elements from the \(\delta +1^{th}\) column onwards for each row. This ensures that only the first unmasked \(\delta \) positions of each row should be paid attention to by the model, which means spans longer than \(\delta \) will not be predicted as entities of a specific type. Figure 4 shows an example of Dual-Masked Global Pointer entity matrices with \(\delta = 8\).

As for the loss function, we utilize the Class Imbalance Loss [40]:

$$\begin{aligned}&\log \left( 1 + \sum \limits _{(i,j) \in P_{\beta }} e^{-score_{\beta }(i,j)} \right) \nonumber \\&\quad + \log \left( 1 + \sum \limits _{(i,j) \in N_\beta } e^{score_{\beta }(i,j)} \right) \end{aligned}$$
(7)

where indices i and j mark the beginning and ending indices of a span, \(P_{\beta }\) denotes the set of spans identified as type \(\beta \), while \(N_{\beta }\) refers to the set of spans that are not entities or do not match the entity type \(\beta \).

Experiments

To thoroughly evaluate the performance and generalizability of the proposed model, we conduct experiments on both our newly constructed Chinese scientific knowledge entity corpus and three widely-used public Chinese NER benchmarks: MSRA, Resume, and CLUENER. This section is structured as follows. Section Experimental setup details the datasets, evaluation metrics, and implementation settings. Section Experiments on our dataset presents comparative results, ablation studies, and robustness analysis under noisy supervision on our dataset. Section Experiments on public datasets reports evaluation results and ablation experiments on public benchmarks to further assess the robustness of our approach.

Experimental setup

Dataset and pre-processing

To support scientific knowledge entity recognition in Chinese scientific literature, we constructed a high-quality corpus covering eight major academic disciplines, including mathematics, physics, chemistry, biology, medicine, computer science, agronomy, and astronomy. The scientific abstracts were collected from two major academic databases: CNKI and Wanfang. Manual annotation was performed by postgraduate students with domain knowledge in corresponding disciplines. All annotations were conducted in accordance with the entity definitions described in Section  Definition of scientific knowledge entities. To ensure annotation quality, we performed manual sampling inspections at a rate of 10%. Annotated samples that did not conform to the guidelines were returned for revision.

The annotated data underwent normalization to ensure consistency in formatting and character representation. Full-width characters were converted to half-width, and punctuation was standardized across all abstracts. After normalization, the data were transformed into a structured span-based JSON format, where each instance comprises the raw abstract and a list of labeled entities defined by their character offsets and types. As depicted in Fig. 5, an annotated datum encompasses the abstract text (denoted by the text field) along with all constituent entities (represented in the entities field). Each entity is characterized by its content (specified in the name field), category (the type field), and positional bounds marked by the start index (the start_idx field) and the end index (the end_idx field).

Fig. 5
figure 5

Example of annotation data

As a result, we obtained a corpus containing 1100 annotated abstracts and a total of 22,184 labeled scientific entities. We randomly divided the corpus from each field in a 8:1:1 ratio, thereby obtaining training (880), validation (110), and test (110) sets that encompass the data from all eight fields respectively. Table 1 provides detailed statistics for the datasets.

Table 1 Statistical information of our datasets

To further validate the effectiveness and generalizability of our proposed method, we conducted comparative experiments on three widely-used public Chinese NER datasets: MSRA [57], Resume [29], and CLUENER [58]. While these datasets are not specifically designed for scientific literature, publicly available and widely adopted Chinese NER datasets in the scientific domain remain limited, which serves as a key motivation for constructing our own corpus tailored to scientific texts. These general-domain datasets were adopted as practical alternatives for evaluating the robustness and transferability of our approach. To facilitate unified processing and evaluation, all public datasets were converted into the same span-based JSON format as used in our own corpus.

To inform the configuration of the maximum entity length \(\delta \) during training, we conducted length statistics across all datasets. Table 2 summarizes the total number of entities, average entity length, and the maximum observed span in each dataset.

Table 2 Entity length statistics across datasets

Evaluation metrics

In this paper, we adopt macro-average precision, recall, and F1-score as the primary evaluation metrics on our constructed dataset. From a practical application perspective, we consider each entity category to be equally important, regardless of its frequency in the corpus. According to the dataset statistics, some categories, such as Scientists and Locations, contain significantly fewer instances than others. Macro-averaging helps prevent high-frequency categories from dominating the evaluation, thereby providing a more balanced and fair assessment across all entity types.

A prediction is considered correct only when both the entity boundaries and category labels match exactly. The metrics are formally defined as:

$$\begin{aligned} Precision&= \frac{1}{C} \sum _{\beta }\frac{TP_{\beta }}{TP_{\beta }+FP_{\beta }} \end{aligned}$$
(8)
$$\begin{aligned} Recall&= \frac{1}{C}\sum _{\beta }\frac{TP_{\beta }}{TP_{\beta }+FN_{\beta }} \end{aligned}$$
(9)
$$\begin{aligned} F1-score&= \frac{1}{C}\sum _{\beta }\frac{2\cdot TP_{\beta }}{2\cdot TP_{\beta }+FP_{\beta }+FN_{\beta }} \end{aligned}$$
(10)

where \(TP_{\beta }\) denotes the count of True Positive for the \(\beta \)-th category, \(FP_{\beta }\) the count of False Positive, \(FN_\beta \) the count of False Negative; C is the number of entity types.

For public benchmarks, however, we follow the convention adopted in prior work and report micro-average F1-scores to ensure comparability with previously published results.

Implementation details

We opted for the PyTorch framework to implement our models. Table 3 presents the detailed experimental environment while Table 4 presents the main model parameters. During training, the model employed AdamW optimizer with the learning rate set to 5e-5, using \(nlp\_structbert\_backbone\_base\_std\)Footnote 1 as encoder, with 12 Transformer block layers and 768 dimensions of hidden layers. The batch size was set to 16, and training was conducted for a maximum of 100 epochs. An early stopping mechanism was applied to prevent overfitting, where training was halted if the macro F1-score on the validation set did not improve for five consecutive epochs. The maximum sequence length was 512, and the maximum entity length \(\delta \) for each dataset was selected based on the length statistics reported in Table 2.

Table 3 Experimental environment
Table 4 Model parameters

For adversarial training, we adopted the PGD-k strategy, where the number of attack rounds k, step size \(\alpha \), and perturbation radius \(\epsilon \) were tuned individually for each dataset. To identify the optimal configuration, we performed a grid search over \(k \in \{3, 4, 5\}\), \(\alpha \in \{0.1, 0.2, 0.3\}\), and \(\epsilon \in \{0.5, 0.8, 1.0\}\). For each parameter combination, we trained a separate model and selected the configuration that achieved the highest macro F1-score on the test set. All adversarial perturbations were applied under the \(L-2\) norm. The final adversarial training parameters for each dataset are summarized in Table 5.

Table 5 Dataset-specific adversarial training parameters

Experiments on our dataset

Baseline models

Given that our objective is to train an end-to-end model for recognizing scientific knowledge entities in Chinese scientific literature, we are supposed to compare the performance of our proposed model with other baseline models on the dateset we have constructed. Therefore, we select several mainstream open-source models that are either SOTA or near-SOTA and retrained them on our dataset. To be more specific, we compare the StructBERT-AT-DMGP model with the following models which utilize StructBERT as the backbone.

StructBERT-CRF: A character-level sequence labeling model utilizes StructBERT as the encoder to encode word vectors, with the CRF layer serving as the decoder to generate sequence labels.

StructBERT-BiLSTM-CRF: A character-level sequence labeling model, incorporating a BiLSTM layer to the StructBERT-CRF model to learn more about contextual features. BiLSTM is a sequence processing model that integrates two distinct LSTM [59] layers, operating in tandem but in opposite directions: one processes the sequence forward, while its counterpart processes it backward.

HGN: A character-level sequence labeling model designed to enhance NER by combining a Transformer-based Hero module for global context with a Gang module for local feature extraction [60]. In our experiments, the BERT/XLNET encoder is substituted with StructBERT.

Biaffine: Yu et al. [61] introduced the Biaffine model, framing the NER task as identifying the beginning and ending indices of entities and assigning types to the corresponding spans. In our experiments, we replace the BERT encoder with the StructBERT.

SpanNER: Fu et al. [62] developed SpanNER, which models span representation by utilizing boundary representations and span length embeddings. In our experiments, the encoder is also substituted with StructBERT.

\({{\textbf {W}}}^{\textbf {2}}\)NER: Li et al. [63] modeled NER as a 2D grid of word pairs and utilized a co-predictor to infer word relations. And then the decoding of entities can be achieved by considering them as a directional word graph. In our experiments, the BERT encoder is also replaced with StructBERT.

Global Pointer: A unified approach proposed by Su et al. [40] to handle flat and nested entities. And our model adopts the identical attention mechanism and loss function as this approach. In our experiments, the BERT encoder is also substituted with StructBERT.

BOPN: Tang et al. [64] proposed the Boundary Offset Prediction Network (BOPN), designed to link entity and non-entity spans while mitigating the problem of an imbalanced sample space. In our experiments, the BERT/BioBERT encoder is also substituted with StructBERT.

StructBERT-MRC: Li et al. [32] pioneered the use of the MRC framework to handle both flat and nested entities. In our experiments, we substitute the BERT encoder with the StructBERT and call this baseline model StructBERT-MRC.

PromptNER: Shen et al. [65] formulated the NER task as a blank completion task and consolidated entity identification and classification into a single round of prompt learning. This model can be considered a combination of span-based and MRC-based methods: it requires identifying the exact boundaries (the left and right tokens) of entities which is a characteristic of span-based methods, and it utilizes prompts in a manner similar to reading comprehension tasks where queries are posed and answers are sought. In our experiments, the BERT/RoBERTa encoder is also substituted with StructBERT.

Comparative results

In experiments, each model was trained five times, and the one with the highest macro-average F1-score on the validation set was used for evaluation on the test set. The specific results of comparative experiments are illustrated in Table 6. Bold values indicate the best performance across all compared models Table 6 and 7.

Table 6 Results of comparative experiment

As shown in Table 6, StructBERT-AT-DMGP has realized the optimal performance across all three metrics with the macro-average F1-score of 70.23%, indicating a strong performance in identifying the correct entities without many false positives or negatives. Global Pointer also shows high precision and reasonable recall, whose F1-score (68.10%) is only 2.13% lower than that of StructBERT-AT-DMGP. This demonstrates that the critical importance of global information in identifying Chinese scientific knowledge entities. BOPN ranks third with an F1-score of 63.78%, reflecting a relatively balanced performance in precision and recall. However, the performance of other span-based methods, namely Biaffine (54.72%), SpanNER (57.32%), and \(\hbox {W}^2\)NER (51.18%), is comparatively weaker. Although these models maintain a balance between precision and recall, both metrics are relatively low, leading to lower F1-scores. This underperformance could be attributed to their limited ability to fully leverage contextual information, constraining their overall effectiveness.

Among the sequence labeling methods, HGN achieves the highest F1-score of 59.90%, once again demonstrating that global context information is just as important as local feature information in enhancing the model’s performance. And StructBERT-BiLSTM-CRF achieves a slightly higher F1-score (55.90%) compared to StructBERT-CRF (53.73%). This marginal improvement can be attributed to the use of the BiLSTM layer in StructBERT-BiLSTM-CRF, which captures sequential dependencies and contextual information more effectively than the StructBERT encoder alone.

Apart from above methods, PromptNER has demonstrated a distinct approach by achieving high precision (73.88%), but its significantly lower recall (54.63%) results in an F1-score of 61.55%. This indicates that PromptNER is highly selective in labeling entities, successfully reducing false positives but sacrificing the identification of many true entities that are relatively ambiguous or complex. And StructBERT-MRC achieves a moderate performance with an F1-score of 57.98%. The relatively good performance of these two models indicates that both the reading comprehension framework and the use of predefined prompt templates can enhance the model’s fine-grained understanding of context.

Ablation study

To verify the effectiveness of the modules integrated into the proposed model, we performed an ablation study. Table 7 presents the results of the ablation experiments based on macro-average metrics, where w/o indicates the removal of specific components, AT refers to adversarial training, and DM denotes the Dual-Masked mechanism incorporated into Global Pointer. Compared to the complete StructBERT-AT-DMGP model, removing either the DM or AT module results in a decline in F1-score. Specifically, excluding adversarial training leads to a 0.47% drop, while removing the Dual-Masked mechanism causes a 0.62% reduction. The performance degradation becomes more pronounced when both modules are removed (denoted as “w/o All”), with the F1-score decreasing to 68.10%, which is 2.13% lower than that of the full model.

Table 7 Results of ablation experiment

These results suggest that both components contribute meaningfully to the overall performance of the model. Adversarial training enhances robustness by improving the model’s tolerance to annotation noise, thereby promoting the learning of more stable decision boundaries. In contrast, the Dual-Masked mechanism encourages the model to concentrate on span regions where entities are more likely to occur, effectively suppressing false positives on overly long spans. This targeted focus not only improves boundary localization but also facilitates a better balance between precision and recall. Together, these two components strengthen the model’s generalization ability and its effectiveness in recognizing diverse entity types.

To further illustrate the learning dynamics of different configurations, we present the training loss and validation F1-score curves in Fig. 6. The training curves indicate that models with AT or DM tend to converge faster and more smoothly. In particular, StructBERT-AT-DMGP yields a near-minimal loss and the highest F1-score with earlier convergence (epoch 29), suggesting improved training stability and generalization. In contrast, “w/o All” converges slightly later (epoch 32) but yields the lowest F1-score on the validation set, highlighting the cumulative impact of removing both modules.

Fig. 6
figure 6

Training loss (left) and validation F1-score (right) for ablation variants. In the left plot, the dotted lines correspond to the final training epochs determined by early stopping. In the right plot, dotted lines indicate the epochs at which each model achieved its highest validation F1-score

Robustness under noisy labels

To further evaluate the effectiveness of adversarial training in improving model robustness under noisy supervision, we design a series of controlled experiments by injecting synthetic label noise into the training data, aiming to mimic typical annotation errors and inconsistencies that commonly arise in manually labeled corpora. Specifically, we simulate three representative types of annotation noise that are commonly observed in real-world NER tasks: boundary errors, label errors, and missing annotations. These types of noise are introduced through the following methods:

  • Boundary Shift: To simulate boundary errors, the start and/or end indices of entity spans are randomly shifted by 1 to 2 characters, resulting in inaccurate boundary annotations.

  • Label Corruption: To simulate label errors, the correct entity type is randomly replaced with an incorrect one sampled from the predefined label set, excluding the original type.

  • Entity Removal: To simulate missing annotations, one or more entities are randomly removed from the original annotation list.

To simulate different levels of annotation noise, we define two control parameters: (1) \(\rho \), the entity-level noise ratio, which specifies the proportion of entities within each selected sample that are perturbed; and (2) \(\eta \), the sample-level noise ratio, which denotes the proportion of training samples that are exposed to noise. In all experiments, we fix \(\rho = 0.1\) to reflect realistic scenarios in which only a small portion of the annotations within a noisy instance are affected. We vary \(\eta \) to control the overall degree of corruption in the dataset.

Figure 7 presents the performance trends under different levels of synthetic noise for three types of annotation errors: boundary error, label error, and missing annotation. We compare the proposed StructBERT-AT-DMGP model with its non-adversarial variant (w/o AT) across varying values of sample-level noise ratio \(\eta \in \{0.2, 0.4, 0.6, 0.8, 1.0\}\). Across all noise types, our model consistently outperforms the non-adversarial baseline, demonstrating enhanced robustness against increasing label noise.

Under boundary error noise, the performance gap becomes more pronounced as noise intensity increases. When \(\eta = 1.0\), StructBERT-AT-DMGP retains an F1-score of 66.80%, while the baseline drops to 65.10%. This indicates that adversarial training helps the model learn more noise-invariant decision boundaries, especially in scenarios involving boundary perturbations.

For label error noise, although both models exhibit similar performance under low noise conditions, the advantage of adversarial training emerges as \(\eta \) increases. At \(\eta = 0.8\), our model achieves 66.66%, outperforming the baseline by nearly 1 point. This suggests that adversarial training mitigates label confusion introduced by incorrect entity types.

In the case of missing annotation noise, the proposed model demonstrates stronger resilience. Notably, when \(\eta = 1.0\), our method maintains 65.94% F1, whereas the baseline degrades more sharply to 64.10%. Since missing annotations can distort the entity distribution and increase supervision sparsity, the regularization effect of adversarial training plays a crucial role in stabilizing the learning process.

Fig. 7
figure 7

Performance comparison under three types of synthetic annotation noise. StructBERT-AT-DMGP consistently demonstrates stronger robustness across increasing noise ratios (\(\eta \))

Overall, these results validate the effectiveness of adversarial training in improving model robustness under different types of annotation noise. The observed improvements are consistent across noise types and intensities, highlighting the value of integrating perturbation-based regularization in real-world NER settings with imperfect annotations.

Experiments on public datasets

We further evaluated StructBERT-AT-DMGP on three widely-used Chinese NER benchmarks: MSRA, Resume, and CLUENER. As shown in Table 8, our model achieves strong and consistent performance across all datasets, with F1-scores of 96.25% on MSRA, 96.62% on Resume, and 83.03% on CLUENER. These results are competitive with, or superior to, recent state-of-the-art models, including those specifically tailored for individual datasets. In particular, our model outperforms several recent adversarial or pointer-based approaches on MSRA and Resume, and ranks among the top methods on CLUENER. The consistently high scores across these benchmarks suggest that the proposed method generalizes effectively beyond the scientific literature domain and remains robust under varied annotation schemes and entity characteristics.

We also performed ablation experiments on each dataset to assess the effectiveness of adversarial training and the Dual-Masked mechanism. Results in Table 9 show that removing either module leads to a consistent drop in performance, with the combination of both yielding the lowest results. This confirms that both AT and DM contribute meaningfully to the model’s generalization and robustness across domains.

Table 8 Precision (P), Recall (R), and F1-score (F1) comparison on three public Chinese NER datasets
Table 9 Ablation study results on MSRA, Resume, and CLUENER

Discussion

In this section, we will discuss the theoretical and practical implications of this study, along with its limitations.

Implications

From a theoretical perspective, we have proposed a novel Chinese scientific knowledge entity recognition model, termed StructBERT-AT-DMGP, which can serve as a methodological reference and guide for future research. Specifically, we incorporate adversarial training into the NER framework to reduce the detrimental effects of noisy supervision arising from annotation mistakes and inconsistencies. In addition, we introduce a Dual-Masked mechanism that incorporates span-length constraints into the training process, guiding the model to avoid predicting implausibly long spans as entities. Together, these two components contribute to more accurate and generalizable scientific knowledge entity recognition, and provide insights into designing noise-tolerant and structure-aware NER systems.

As for the practical implications, the scientific knowledge entities extracted using the model proposed in this paper not only help researchers quickly identify and understand the latest advancements in their fields, but also can be applied for subsequent in-depth, fine-grained research. These entities can serve as fundamental units for storage, analysis, or evaluation, thereby playing a pivotal role in knowledge base construction, knowledge evolution analysis, and scientific evaluation. Specifically, the automatic extraction of scientific knowledge entities from scientific literature in the general domain greatly reduces the time and effort needed for data processing. This capability is crucial for developing dynamic knowledge bases that can continuously adapt to new research findings. Moreover, the integration of these knowledge entities across different scientific domains enhances interdisciplinary research, allowing for the synthesis of insights from diverse fields such as biology, chemistry, and physics, thus leading to breakthrough innovations. Additionally, by providing a structured way to assess scientific output, these entities contribute to more objective and transparent scientific evaluations.

Limitations

Although our work has taken into account the entire process of scientific knowledge entity recognition as comprehensively as possible, this study still possesses certain limitations. First, while the training corpus of this study covers eight academic disciplines, the volume and diversity of the training data in each field remain challenging due to the costs associated with manual annotation. Furthermore, given the highly specialized and terminology-intensive nature of the corpus of scientific literature, the accuracy cannot yet be compared with that of common entity recognition tasks. Although the scientific knowledge entity recognition model we have developed already significantly surpasses mainstream models in terms of precision, there is still room for improvement. Lastly, this study focuses on extracting knowledge entities from scientific literature abstracts rather than the full texts. While abstracts contain critical information from the articles, they may still overlook other important knowledge points within the text.

Conclusion

Knowledge entities in scientific literature are of great significance for researchers to grasp the core content of a field and conduct further fine-grained studies. In this paper, we propose a more comprehensive and semantically enriched schema for Chinese scientific knowledge entity recognition, defining nine fine-grained categories that capture core components of scientific research across eight representative domains. Based on this schema, we construct a high-quality annotated corpus that offers broader semantic coverage and domain generality, serving as a foundational resource for advancing Chinese scientific literature analysis.

To address the challenges of annotation noise and boundary complexity in Chinese scientific literature, we develop a novel model termed StructBERT-AT-DMGP, which integrates adversarial training and a Dual-Masked mechanism into the span-based NER framework. Adversarial training is employed as an implicit regularization strategy to alleviate the impact of noisy supervision arising from cross-disciplinary annotation inconsistencies and span-level ambiguity. Meanwhile, the Dual-Masked mechanism explicitly incorporates entity length constraints into the span prediction, reducing the likelihood of overlong and implausible entity spans. Together, these components enhance the model’s robustness and improve boundary detection in structurally complex scenarios.

We evaluated the proposed model on both our constructed corpus and three widely-used public Chinese NER benchmarks–MSRA, Resume, and CLUENER. On our dataset, the model outperformed several mainstream open-source baselines, and ablation studies confirmed the utility of the proposed components. On public benchmarks, the model achieved competitive or superior performance compared with state-of-the-art methods, further demonstrating its cross-domain generalizability and robustness.

In future work, we will explore recognizing scientific knowledge entities in longer texts, meaning that the input of the model will not be limited to paper abstracts but will be extended to full texts.