Data Knowledge Organization Unit

The unit promotes technological development related to the realization of information infrastructure, including the knowledge base necessary for organizing, integrating, and promoting the utilization of different kinds of research data produced at RIKEN.

News

There are currently no news.

Unit Overview

The unit promotes technological development related to the realization of information infrastructure, including the knowledge base necessary for organizing, integrating, and promoting the utilization of different kinds of research data produced at RIKEN. Specifically, we promote research and development associated with metadata attached to research data, develop a knowledge base by collecting and organizing metadata, and provide and support associated technology and tools for laboratories, as well as technology development required for the realization of open science. Furthermore, through open-science practices by employing the knowledge and the latest AI, we lead the era of data science and contribute to scientific and technological innovation with RIKEN researchers regardless of the research field.

Main Research Field

Informatics

Related Research Fields

Engineering Chemistry Complex Systems Interdisciplinary Science & Engineering Mathematical & Physical Sciences Biological Sciences Intelligent informatics Web informatics Service informatics Computational science

Research Keywords

Open science Semantic web Ontology Linked Open Data Intelligent data platform

Unit Leader Interview

Charting a ‘Map of Scientific Knowledge’

— A Cross-Disciplinary Research Infrastructure Woven by Metadata and AI

From Physics-Based Thinking to ‘Knowledge Organisation’

I majored in physics as an undergraduate. In the natural sciences—not only in physics—the more complex the phenomena we deal with, the more important it becomes at the outset to clarify how the subject is defined, which aspects are isolated as variables, and under what conditions, units, and assumptions the data were obtained, whether by measurement or calculation. If these points are ambiguous, it becomes difficult to compare data, reproduce results, and verify findings. This habit of verifying only after establishing the relevant definitions, conditions, and uncertainties is much like that in programming and information science, both of which deal with data and procedures. It provided an important foundation for my growing interest in information science.

One of the first things I worked on at RIKEN was the construction of databases and search engines for the life sciences. As I pursued this work, I came to realise that life-science data are extremely diverse, both in type and in how they are represented, and that even the same term can differ in meaning or granularity. Accordingly, in a way analogous to aligning coordinate systems or units in physics, I began working with metadata that give data meaning and context, ontologies that align the definitions of concepts and their relationships, and Semantic Web technologies that allow them to be shared and linked in machine-readable form. I did so because I believed that such ‘meaning’ and ‘structure’ are indispensable if data are to be correctly understood, connected, and reused, even across different fields and laboratories.

What we refer to here as ‘knowledge organisation’ is not simply a matter of labelling and organising data. It means using metadata to clarify provenance and conditions, using ontologies to align conceptual variations—such as synonyms, differences in granularity, and classification—and using the Semantic Web framework to represent relationships among data in a form that machines can follow. In this way, data are brought into a state in which they can be treated as a ‘knowledge base’ (or knowledge graph) that can be searched, integrated, and reused across different fields. The name of our unit, the Data Knowledge Organization Unit, reflects this way of thinking at its core.

Now that AI has become widespread, its importance has, if anything, only increased. Generative AI and LLMs enable flexible, intuitive dialogue and excel at taking researchers’ ambiguous questions and returning well-organised, language-based responses. At the same time, in scientific research, it is crucial to know what data a conclusion is based on, whether data obtained under different conditions have been mixed, and the source and provenance of the data. Answers without a clear evidential basis, as well as plausible-sounding errors (‘hallucinations’), therefore pose major risks. For this reason, our Data Knowledge Organization Unit takes the view that the formal knowledge supported by the Semantic Web—defined concepts and relationships based on human consensus—should be integrated with the linguistic flexibility at which LLMs excel. The ontology-based Semantic Web serves as a shared vocabulary and point of reference for definitions within a research domain—in effect, a ‘dictionary’—and provides verifiable evidence to which AI can refer.

Specifically, we place particular emphasis on RAG (Retrieval-Augmented Generation), in which an LLM serves as the front-end interface. At the same time, behind the scenes, accurate information is retrieved from the Semantic Web and knowledge bases to generate responses. While the LLM receives the user’s ambiguous question, it also uses rigorous metadata and ontologies to clarify exactly what is being referred to and which data, under what conditions, should be consulted before returning evidence-based information. Through this kind of design, we believe that AI can become more than merely a conversational partner: it can serve as a trustworthy partner in exploration—an agent that helps researchers find the data and knowledge they need, and supports comparison, integration, and reuse.

Precisely because we are in the age of AI, the key to accelerating research lies in establishing the underlying ‘meaning’ and ‘structure’ that connect data across fields and make it reusable. We hope to build that foundation together with researchers on the ground by combining metadata, ontologies, and AI implementation.

‘Research Data That Speak for Themselves’ — Creating High-Quality Research Data to Open Up the Future of Science

At present, RIKEN is advancing the cross-disciplinary project ‘Transformative Research Innovation Platform’ (TRIP). Within this initiative, our unit participates as the ‘Data Sharing Platform’. It is working to develop a cross-disciplinary information environment that promotes the aggregation, sharing, and use of high-quality research data. Here, ‘high-quality data’ refers to data to which appropriate metadata have been attached, so that the experimental conditions, generation process, and provenance can be traced. In other words, it refers to a state in which, by examining the data alongside its metadata, one can understand the background and assumptions, and others can reuse the data with confidence.

Research data may be handled based on implicit assumptions made by the person who created them or by others within the same laboratory. However, as time passes, or when they are used by researchers in other fields or by different researchers, reuse becomes difficult because the relevant conditions and context are no longer clear. For example, if information such as what was measured, the units used, the instruments and methods, pre-processing, sample conditions, analysis conditions, quality indicators, and licence or usage restrictions is missing, the data’s provenance and the assumptions underlying its interpretation become unclear, and the data can no longer be regarded as trustworthy for research use. Attaching metadata is therefore an essential part of laying the foundation to make such background information explicit and to create a state in which others can correctly understand the data’s provenance and preconditions, and use it with confidence for research. It is indispensable for transforming data from a ‘one-off output’ into a ‘research resource’ that continues to generate value.

Thus, attaching metadata is not merely a matter of organising and tidying data. It is the work of articulating assumptions and conditions that tend to remain implicit in research and describing knowledge in a form that others can share and reuse. Furthermore, from the perspective of knowledge organisation, it is important to create a state in which identical entities and identical concepts are treated consistently as such. In other words, by using ontologies to align terminology and levels of granularity that tend to vary across fields, and by making explicit the correspondences among data—such as synonymy, hierarchical relationships, and associations—we can treat data not as isolated ‘points’, but as ‘connections’.

To aggregate and integrate metadata, it is important to define standardised metadata elements in advance, along with concepts and vocabularies that can be shared across fields, and then map metadata from each field to these for use. Furthermore, by defining relationships among metadata, it becomes possible to connect metadata from different fields and handle them in an integrated manner. As data from different fields are thus linked through a common vocabulary and format, a ‘map of knowledge’ emerges, enabling a cross-disciplinary overview of findings that had previously existed only as fragments. We aim to realise an infrastructure that links, through consistent metadata, everything from data generated in research settings to publicly available data, so that anyone can access the knowledge they need and use it while verifying the underlying evidence.

What We Look for in Candidates: A Commitment to Supporting Researchers and Intellectual Diversity

In our unit, what matters more than whether someone possesses specific programming skills is a commitment to supporting researchers in their work, and we welcome people from diverse backgrounds. The main role of our unit is not to generate research data ourselves, but to support researchers who are already researching so they can organise, integrate, and make effective use of their data. Rather than applying methodologies from information science unilaterally, we seek people who, from the perspective of research DX—the digital transformation of research—place priority on understanding what is truly valuable in research settings and what will genuinely support research activities, and who can solve problems through discussion on an equal footing with researchers.

Nor is prior research experience in the natural sciences necessarily a prerequisite. In the past, for example, people engaged in practical work in fields outside the natural sciences have joined our summer school to learn the foundations for applying AI to knowledge from their own domains. We welcome expertise from a wide range of fields, including not only information science, physics, and the life sciences, but also the humanities and museum studies. Together with colleagues who have a broad interest in scholarship and who find meaning in accurately describing the worldviews of their respective fields as data, we hope to build a new information infrastructure for science in the age of AI.

Norio Kobayashi (D.Eng.)

Unit Leader

Specialises in knowledge engineering and distributed Web processing. Received a PhD in Engineering after completing the Doctoral Program in Engineering, Course of Information Sciences and Electronics, University of Tsukuba. After serving in positions including Senior Research Scientist at the Advanced Center for Computing and Communication, RIKEN, he assumed his current position in 2019.

Selected Publications

  1. Kobayashi, N., Kume, S., Lenz, K., and Masuya, H.: "RIKEN MetaDatabase: A Database Platform for Health Care and Life Sciences as a Microcosm of Linked Open Data Cloud" International Journal on Semantic Web and Information Systems 14(1), 140-164 (2018).,
  2. Morita, M., Shimokawa, K., Nishimura. M., Nakamura, S., Tsujimura, Y., Takemoto, S., Tawara, T., Yokota, H., Wemler, S., Miyamoto, D., Iken, H., Sato, A., Furuichi, T., Kobayashi, N., Okumura, Y., Yamaguchi, Y., and Okamura-Oho, Y.: "ViBrism DB: an interactive search and viewer platform for 2D/3D anatomical images of gene expression and co-expression networks" Nucleic Acids Research 47(D1), D859-D866 (2019).,
  3. Yamaguchi, A., Kozaki, K., Yamamoto, Y., and Kobayashi, N.: "Semantic Graph Analysis for Federated LOD Surfing in the Life Sciences" The 7th Joint International Semantic Technology Conference (JIST). Special Track: Semantic Processing for Knowledge Graphs, LNCS 10675, 268-276 (2017).,
  4. Kume, S., Masuya, H., Maeda, M., Suga, M., Kataoka, Y., and Kobayashi N.: "Development of Semantic Web-based Imaging Database for Biological Morphome" The 7th Joint International Semantic Technology Conference (JIST). Special Track: Semantic Processing for Knowledge Graphs, LNCS 10675, 277-285 (2017).,
  5. Dumontier, M., Gray, AJG., Marshall, MS., Alexiev, V., Ansell, P., Bader, G., Baran, J., Bolleman, JT., Callahan, A., Cruz-Toledo, J., Gaudet, P., Gombocz, EA., Gonzalez-Beltran, AN., Groth, P., Haendel, M., Ito, M., Jupp, S., Juty, N., Katayama, T., Kobayashi, N., Krishnaswami, K., Laibe, C., Le Novere, N., Lin, S., Malone, J., Miller, M., Mungall, CJ., Rietveld, L., Wimalaratne, SM., and Yamaguchi, A.: "The health care and life sciences community profile for dataset descriptions" PeerJ 4:e2331 (2016).,
  6. Kobayashi, N.:"Bio-medical entity prioritisation based on literature with Semantic Web annotations" BMC Proceedings 9(5):A6 (2015).

Members

Position Name
Unit Leader Norio Kobayashi
Senior Research Scientist Hiroshi Masuya
Postdoctoral Researcher Julio C Rangel
Technical Scientist Masaki Kato
Technical Scientist Masako Nomoto
Visiting Scientist Hideaki Takeda
Visiting Scientist Yasunori Yamamoto
Visiting Scientist Atsushi Fukushima

Contact