May 8, 2024
Stop squandering data: make units of measurement machine-readable

Stop squandering data: make units of measurement machine-readable

Technicians work on NASA's Mars Climate Orbiter which was launched into space on December 11, 1998.

Technicians work on NASA's Mars Climate Orbiter which was launched into space on December 11, 1998.

Technicians work on NASA’s Mars Climate Orbiter. It burned up near the planet because two teams had used different units to calculate thrust.Credit: NASA

In 1999, when NASA’s Mars Climate Orbiter missed its intended orbit and burned up in the Martian atmosphere, the media had a heyday over the reason: one team had used metric units in its thrust calculations, another, imperial. The navigation software that exchanged this information lacked a built-in process to check units. So when one team’s software produced data in imperial units rather than the expected metric ones, the spacecraft was set on the wrong trajectory. The result was the loss of five years of effort and hundreds of millions of taxpayers’ dollars.

Two decades on, such problems persist. Researchers across fields often assume that their colleagues understand details without specifying them, and are therefore remiss when documenting units. Sometimes they leave them out entirely, provide ones that have multiple definitions or use units of convenience that have never been formally recognized.

Humans struggle to interpret numbers with sloppy or missing units, and it is much more difficult when computers are involved. Most software packages, data-management tools and programming languages lack built-in support for associating units with numeric data (with the exception of the language F#). This means that information is essentially stored and managed as ‘unitless’ values. Disciplines including bioscience and aerospace engineering have adopted conventions for unit representation, such as the Unified Code for Units of Measure (UCUM) and the Quantities, Units, Dimensions, and Types (QUDT) Ontology. But there are no broadly agreed technical specifications for how to represent quantities and their associated units without confusing machines.

There have been many calls in recent years to make data sets FAIR (Findable, Accessible, Interoperable and Reusable), and to ensure that open data abide by the 5-star deployment scheme suggested by World Wide Web inventor Tim Berners-Lee, which aims to make them findable, free and structured. Many researchers are now committed to depositing data in free and open repositories with appropriate metadata.

Chaos around units undermines these efforts. Already, many scientists invest more time in wrangling data than doing research. When data are not interoperable or machine readable, researchers’ individual informatics approaches are thwarted. The benefits of data sharing shrink.

Unless we take steps to ensure that measurement units are routinely documented for easy, unambiguous exchange of data, information will be unusable or, worse, be misinterpreted. All global challenges, from pandemics to climate change, require high-quality data across multidisciplinary, international sources. Mistakes and lost opportunities will cost humanity much more than hundreds of millions of dollars for a single crashed spacecraft.

We are a group of scientists who are tackling this challenge, with backgrounds in chemistry, computer science, metrology and more. In 2018, the global collaboration CODATA (Committee on Data of the International Science Council) formed the Task Group on Digital Representation of Units of Measurement (DRUM). The goal of DRUM is to work with international science unions under the International Science Council to raise awareness of units and quantities in digital formats and to enable their communities to represent them. In 2019, another group — the International Committee for Weights and Measures (CIPM), an intergovernmental association — formed the Digital International System of Units (Digital SI). The Digital SI Expert Group has goals that are complementary to those of DRUM, focusing on worldwide agreed norms for unit representation in the metrology community. All authors of this Comment article are members of one or both of these groups.

Now, a few years into our mission, we need the community’s help. We ask scientists, information technologists and standards organizations to provide us with case studies, problem areas, pain points and solutions (see ‘Call to action’).

Call to action

Here’s how everyone can help to create interoperable data with machine-readable quantities and units of measurement.

Scientists: Pay attention to whether units are present and properly annotated. Demand that your software or analysis tools are able to associate quantities with units. Use symbols that can be widely understood.

Developers: Be aware of the broadly adopted digital representation systems for units. Choose one to incorporate in your systems.

Funders: Support development efforts to build fully interoperable representation platforms and services for units.

Everyone: Share your use cases, pain points and solutions (contact [email protected]). Find out whether your professional society or science union has a designated ambassador and get in touch.

Unitless world

Plenty of measurements are taken and reported without units in the everyday world. The units are often assumed for a particular context. Take temperature — ‘in the 20s’ is bitter cold in the United States, which uses Fahrenheit, but a mild summer day in countries that use Celsius. And cholesterol is measured either in milligrams per decilitre or millimoles per litre, depending on the country. Skilled people can usually infer what is meant by unitless numbers in scientific papers and data sets, but not always. The task of untangling such issues is even harder for computers, which cannot generally draw on context and common sense.

Some units mean different things in different situations. A Calorie with a capital C, used to describe food energy, is equal to 1 kilocalorie — conventionally the amount of energy needed to heat a kilogram of water by 1 °C at standard atmospheric pressure. So, calories and Calories differ by a factor of 1,000, but the term cal (lower-case c) is used extensively for both. Although the intended meaning might be obvious to a person interested in thermodynamics or the nutritional value of a hamburger, it is obscure to a computer. Likewise, the gravitational constant G is often confused with g, the local acceleration due to gravity, yet g is also used for grams. The metre is sometimes written as M, which is also the prefix mega, and the unit for molarity. These conventions and more cause computers to stumble.

Often, the same quantities are represented in different units. Solubility, for example, is legitimately expressed as kilograms per litre (kg l–1) or moles per cubic decimetre (mol dm–3). These can be converted easily, but only if units are documented properly. And sometimes the same unit is written in multiple ways. A microgram can be written as mcg, ug or µg. Acceleration in metres per second squared can be represented as m/s2, m/s^2, m/s2 or m.s−2. Typesetting conventions use a range of character sets, italics, bolding, slashes, superscripts and subscripts. These are clear to humans, but too inconsistent to be read reliably by machines. There are too many units and too many variations to automate parsing or to map them all into an unambiguous and interoperable representation.

The computer systems used to crunch and share data are not set up to help. Take the simple example of Excel spreadsheets: the only unit that can be included in computable fields is a currency sign. The association of a unit with a quantity value is left to arbitrary, inconsistent practices, such as a unit string given in the header row. That association is easily broken when data are transferred or used in calculations.

Untangling the mess

Much work is under way to solve these problems. Many standards, conventions and best practices around units are readily available. The widely adopted International System of Units (SI units) provides standard names and typographical representations for quantities and their associated units. Other international initiatives have also achieved a great amount of standardization, for example through the International Organization for Standardization (ISO), the International Electrotechnical Commission (IEC) and the United Nations Economic Commission for Europe.

The forum to produce FAIR Digital Objects (FDO Forum) aims to improve the representation and transmission of scientific information, including fully machine-actionable semantics. In principle, FAIR Digital Objects “bind all critical information about an entity in one place and create a new kind of actionable, meaningful, and technology independent object that pervades every aspect of life today”, according to the forum. But there is much more work to do.

Around 20 systems have been put forward to enable machine reading. These include UCUM, the QUDT Ontology, the Ontology of units of Measure (OM), the IEC Common Data Dictionary (IEC CDD) and the Unidata Units (UDUNITS) package. All have shortcomings; each serves the needs of different communities.

Several efforts try to connect conventions to promote interoperability, or allow analyses to combine different data sets. For example, the Units of Measurement web service applies UCUM code to map between definitions in six systems for unit representation, each prepared by a member of our task force. A pilot Units of Measurement Interoperability Service is being developed by another DRUM member that intends to cover more representation systems (see go.nature.com/3vevfdo). Because none has been fully adopted, there is no universal system to bridge them.

Since being launched, DRUM and Digital SI have worked to raise awareness and to support efforts to improve interoperability together with national and international organizations, including the CIPM, the International Science Council, the Research Data Alliance and the GO FAIR Initiative.

As part of this, we want to organize the many legacy solutions that have already been applied to achieve interoperability. One goal is to collect these and build an ‘information layer’ around them, a sort of helpline for computers.

Another, more ambitious goal has been taken up by the higher-level Digital SI Task Group that appointed the Digital SI Expert Group: building a robust, unambiguous data-exchange framework based on the SI units. This would help to resolve long-standing issues in a robust manner. For instance, it could curtail the practice of representing units for particular quantities in multiple ways, to ensure that future systems do not perpetuate the problems that saddle the digital domain today. Ultimately, the project will produce norms for unit representation across the global metrology community, from basic research to industrial and commercial applications, and keep them flexible enough to serve diverse constituents.

So far, DRUM and the Digital SI Expert Group have collected a dozen use cases and curated a list of nearly 50 available unit representation systems to improve understanding of how units are expressed in databases, digital publishing, software, code, scripting and scientific field vocabularies and ontologies (see go.nature.com/38mbpxo).

DRUM has also developed a network of 26 ‘ambassadors’ from 46 international science unions and associations, and the DRUM task group is conducting surveys on how units are used, the results of which will be reported later this year.

Community effort needed

That report is meant to be a stepping stone. The entire scientific community needs to agree on a model to represent quantities and units. These should include formal definitions suitable for humans and for machine processing. Databases that allow access to this knowledge should be established. They should deploy service-oriented infrastructures (such as websites and computer applications) for information and unit conversions. Programming environments, analytical software and data-storage platforms must become ‘unit aware’.

DRUM can seed this work, but it will not succeed without broad collaboration across many scientific and information-technology communities. Funding agencies and private-sector companies should support the effort, which is currently being undertaken by groups of volunteers, such as ourselves. Assigning even a small proportion of current R&D funding to the work would yield broad, huge gains and enable national and international agreements to promote the use of clear, interoperable units.

Everyone agrees that intelligible, useful data are at the heart of good science, and that insights from diverse disciplines are required to understand and ameliorate global problems. Research systems are not meeting those needs. It is time to make data and knowledge readily available to machines and humans.

Source link