Annual Meeting Reports

Dealing with Metadata: Content, Distribution, and Availability

Marcia Zeng began the session with an overview of metadata: “structured, encoded data to describe characteristics of informationbearing entities, or things” that aid in the discovery and identification of those things. Metadata can be technical, descriptive, and administrative: for example, metadata for a photograph can include technical information (resolution and file size); descriptive information (what the image is, where it was taken); and administrative data (where the original is located and copyright or license details). Zeng also discussed “learned” metadata, such as Amazon recommendations based on user browsing and buying history. In Amazon’s case, users of metadata also contribute to the metadata and, in doing so, help the system evolve.

Todd Carpenter spoke about standards. Metadata are necessary to navigate a digital environment and must be structured according to regular, defined standards, which, ideally, are governed by standards organizations. Each set of standards is highly specialized according to the purpose and type of data collected and the community and domain using the data. Not all attributes of a thing can or should be identified or described, but useful information should be provided. One can apply one’s own “functional granularity” based on business needs. For example, a magazine publisher does not need to differentiate between copies of the same issue, but libraries do; each institution applies its own tracking data to the existing publication and issue information provided by the publisher. The International Standard Link Identifier (ISLI) has been developed to more effectively link the different standards, structures, and layers of metadata in a clear, machine-readable way. Carpenter stressed the importance of managing identifiers and metadata. Although difficult and expensive, maintenance is worth it; the alternative—poorly managed data— costs more in the long run.

Marjorie Hlava offered nine steps to implementing metadata in a workflow. The first five steps focused on setup and basic functionality: construct a taxonomy of subject metadata; apply it to legacy content for semantic enrichment; integrate indexing tools into preexisting systems, including websites and manuscript-submission systems; simplify metadata gathering and indexing and collect most information early in the process; and use indexing in searches for faster browsing and more accurate results. Step 6 involves metadata maintenance. As new concepts are introduced into a field or a field expands, the taxonomy will need to keep pace. The final three steps involve further leveraging metadata: develop add-ons based on the acquired data, such as autoassignment of reviewers at submission and semantic fingerprinting for disambiguated author pages; enhance search features such as search suggestions and article recommendations; and use metadata-driven analytics to determine trends over time and to better understand both author and user behavior.

Matt Stratton offered practical production applications of metadata used at the American Institute of Physics (AIP). In AIP’s submission system, metadata are collected via author-input forms (e.g., openaccess choice, funding information, and keywords) or generated by the system itself (such as submission and acceptance dates). The submission system also collects metadata on reviewers, including quality and timeliness metrics and subject matter reviewed. AIP is developing functionality to crossreference these two sets to automatically suggest reviewers for papers.

AIP’s Scitation platform creates disambiguated author pages based on semantic fingerprints; the fingerprint pulls from common topics in an author’s history across institutions and name presentations and excludes results from other authors with the same name or similar names. In the question-and-answer session, Stratton added that this feature was largely accurate but did require refinement, echoing both Carpenter’s and Hlava’s points about necessary maintenance. The website also collects learned metadata, including reader behavior, imported citation data, and article-level metrics. When users download a PDF, they see a cover page with article recommendations that change as the system acquires new data.

Clean, well-structured metadata provides countless immediate and potential benefits for content management and the development of new technologies and features. Maintaining the metadata can often be both expensive and time-consuming but is invaluable.