A web-based system for creating, storing, documenting, and using definitions (electronic phenotypes) used in health research
Project Lead: Former user (Deleted) (d.s.thayer@swansea.ac.uk)
Lead Developer: Muhammad Elmessary
A significant aspect of research using routinely collected health records is defining how concepts of interest (including conditions, treatments, symptoms, etc.) will be measured. This typically involves identifying sets of clinical codes that map to a variable that the researcher wants to measure, and sometimes a set of rules as well (e.g. a sufferer from a disease may be defined as someone who has a diagnosis code from list A and a medication from list B, but excluding anyone who has a code from list C). A large part of the analysis work may involve consulting clinicians, investigating the data, and creating and testing definitions of clinical concepts to be used.
Often the definitions that are created are of interest to researchers for many studies, but there are barriers to easily sharing them. The definitions may be embedded within study-specific scripts, such that it is not easy to extract the part that may be of general interest. Also, often researchers do not fully document how a concept was created, its precise meaning, limitations, etc. Crucial information may be lost when passing it to other researchers, resulting in mistakes. Often there simply is no mechanism to discover and share work that has been done previously, leading researchers to waste time and resources reinventing the wheel. In theory, when research is published, information on the precise methods used should be included, but in reality this is often inadequate.
If a solution to better sharing definitions used could be implemented, it would facilitate faster, higher quality studies, helping realise the benefit of observational health data.
Our over-arching goal is to create a system that describes research study designs in a machine-readable format to facilitate rapid study development; higher quality research; easier replication; and sharing of methods between researchers, institutions, and countries. We envision something akin to Manitoba Centre for Health Policy's concept dictionary, but extending beyond documentation to allow concepts to be used directly in analysis.
The present project is a pilot for this larger work, focusing on the narrower task of storing, managing, sharing, and documenting clinical code lists themselves. The specific goals of this work are:
A web-based interface will be provided for users to interact with the library: search for relevant concepts, add new concepts, edit existing ones, etc. This interface will provide the ability to browse and search clinical code reference tables.
In addition, it will be possible to interact with the tool via an API, so that a concept can be retrieved and used directly within a script for data preparation and/or analysis.
A common use case would be a user working with the web interface to find and/or create the concept definitions necessary for research, then once this is complete, using a library in their statistical language of choice to reference these concepts in their scripts.
The API will also be able to perform the add/edit functionality, so it can be used to synchronise with other sources (for example, import sets of codes defined externally).
The library is a standalone web-based tool which stores code lists in its own database. User interaction is facilitated with a Python web application, while an API (in development) will allow programmatic interaction with the tool from a variety of environments.
The basic entity in this system is a “concept”, a group of clinical codes that defines a single meaningful event, condition, etc. within the data.
Each concept is made up of one or more components. There are several types of components. A component can be simply a list of clinical codes. It can also be defined as a regex that matches a certain set of codes (for example, a regex could match a certain range of ICD10 chapters). Another concept can also be a component. This allows defining a broad concept that is made up of several more specific concepts (for example, a concept for heart disease could consist of concepts for coronary artery disease, heart attack, cardiomyopathy…).
Components can be defined as inclusion or exclusion.
In addition, an entity called a "working set" has been created. A working set is a collection of related concepts, with documentation, which can have arbitrary user-defined attributes attached to each concept. This allows storing a related set of concepts that are related to a project, algorithm, etc. with relevant information about how each is used. For example, a working set could store the definition of the Charlson Index: a concept for each disease category, with an attribute that stores the weight of each category in the calculation of the score.
One challenge of this work is that many researchers working with this type of data are required to work in isolated, secure environments without access to the internet. There may also be requirements that local knowledge be stored locally.
We are exploring several different security models to allow work within secure, restricted environments. One option would be a web-facing, editable version of the tool that is mirrored read-only into secure environments. Alternatively, a server could exist within each secure environment, with a limited release of content to a web-facing server where approved.
We currently have a working beta version that is being used internally in the SAIL Analytical Services team for our own research projects. We plan to roll it out to the wider SAIL community and make it available on the web in coming months.
We are interested in getting feedback from researchers at other centres as to its usefulness, and also in building potential collaboration for further development. We hope this will be useful to researchers at a wide range of institutions, not just our own, and would welcome members of other research centres to join us and help direct the project as stakeholders. We would also be interested in collaborating on grant applications to fund further development, should appropriate funding sources be available.