Phase 01 · Cleaning

What the cleaner does, on real records.

Five failure modes show up in the raw dataset. Each section below is a live sample record: on the left the raw input, on the right what the cleaner produced.

Issue 01

Unicode noise

Scraped abstracts arrive with smart quotes, em-dashes, and non-breaking spaces. Downstream tokenizers treat these as distinct tokens. The cleaner normalizes them to plain ASCII punctuation.

Before

Raw reference

aid
1908.07919
mid
2412782625
  • @cite_107mid 2412782625invalid · unicode noise
    In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First , we highlight convolution with upsampled filters, or atrous convolution, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second , we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third , we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed DeepLab system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

What changed

  • Curly quotes ‘ ’ “ ” → straight ' "
  • Em-dashes — → -
  • Non-breaking spaces → regular spaces

Issue 02

Table-of-contents abstract

Some cited papers have no real abstract on the source page — the field contains a numbered outline (“I. Introduction … II. The model…”) instead of prose. The cleaner detects the pattern and drops the reference.

Before

Raw record

aid
cs9809108
mid
2949225035

Query abstract

We present our approach to the problem of how an agent, within an economic Multi-Agent System, can determine when it should behave strategically (i.e. learn and use models of other agents), and when it should act as a simple price-taker. We provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required. The agents were implemented and different populations simulated in order to learn more about their behavior and the merits of using and learning agent models. Our results show, among other lessons, how savvy buyers can avoid being cheated'' by sellers, how price volatility can be used to quantitatively predict the benefits of deeper models, and how specific types of agent populations influence system behavior.

Related work

Within the MAS community, some work @cite_15 has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite_6 and @cite_8 show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite_9 is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.

What changed

  • @cite_15 flagged invalid_reason: outline (roman-numeral table of contents, no prose)
  • @cite_8 flagged invalid_reason: empty (source page returned no abstract)
  • Kept 2 of 4 references for the downstream extractive step

Issue 03

Literal “Abstract:” prefix

Some references bring the word “Abstract:” through from the source page. That leaks a non-content token into the summary and shifts word counts. The cleaner strips the prefix and keeps the sentence.

Before

Raw record

aid
1908.07919
mid
2969825080

Query abstract

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams ; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at this https URL .

Related work

The fully convolutional network is extended, by replacing a few (typically two) strided convolutions and the associated convolutions with dilated convolutions, to the dilation version, leading to medium-resolution representations @cite_81 @cite_107 @cite_10 @cite_46 @cite_64 . The representations are further augmented to multi-scale contextual representations @cite_81 @cite_107 @cite_140 through feature pyramids for segmenting objects at multiple scales.

What changed

  • @cite_46 leading “Abstract:” prefix stripped
  • @cite_64 and @cite_10 flagged invalid_reason: empty (no text)
  • Kept 4 of 6 references

Issue 04

Related-work is only citations

A handful of records have a related-work paragraph that is literally just @cite_x tokens strung together — zero narrative connecting them. There is nothing for the extractive stage to work with, so the cleaner drops the whole record.

Before

Raw record

aid
1908.04464
mid
2967193285

Query abstract

Entity linking is a fundamental database problem with applicationsin data integration, data cleansing, information retrieval, knowledge fusion, and knowledge-base population. It is the task of accurately identifying multiple, differing, and possibly contradictingrepresentations of the same real-world entity in data. In this work,we propose an entity linking system capable of linking entitiesacross different databases and mentioned-entities extracted fromtext data. Our entity linking solution, called Certus, uses a graph model to represent the profiles of entities. The graph model is versatile, thus, it is capable of handling multiple values for an attributeor a relationship, as well as the provenance descriptions of thevalues. Provenance descriptions of a value provide the settings ofthe value, such as validity periods, sources, security requirements,etc. This paper presents the architecture for the entity linking system, the logical, physical, and indexing models used in the system,and the general linking process. Furthermore, we demonstrate theperformance of update operations of the physical storage modelswhen the system is implemented in two state-of-the-art databasemanagement systems, HBase and Postgres.

Related work

@cite_4 @cite_23 @cite_25 @cite_26 @cite_20 @cite_17 @cite_21 @cite_15 @cite_28 @cite_8 @cite_18 @cite_12 @cite_22 @cite_14 @cite_27 @cite_6 @cite_5 @cite_19 @cite_11 @cite_16

What changed

  • Detected: related_work contains only @cite_x tokens plus whitespace
  • Record dropped entirely (aid, related work, all references)
  • Nothing forwarded to the extractive stage

Issue 05

Empty reference abstract

The source page returns no abstract text for a cited paper. The record itself is fine — just that one reference has nothing to summarize. The cleaner marks the ref invalid_reason: empty and drops it from ref_abstracts.

Before

Raw reference

aid
cs9809108
mid
  • @cite_8invalid · empty
    (empty string)

What changed

  • @cite_8 had empty abstract string in the raw record
  • Marked invalid_reason: empty
  • Removed from ref_abstracts — record kept, other refs unaffected