Standardising Benchmark Problems for the Assessment of Computerised Medical Guideline Systems
Kirsty Bradbrook1, Graham Winstanley1, Vivek Patkar2, David Glasspool2 and
1 School of Computing, Mathematical and Information Sciences, University of Brighton
2 Advanced Computing Laboratory, Cancer Research UK
Abstract. There is currently a high-level of research being undertaken into the development of systems and strategies to support the effective presentation, development and use of clinical practice guidelines and protocols. Such systems are being evaluated individually and comparatively but without the availability of a standardised framework on which they can be assessed. Taking inspiration from the field of AI planning, we propose that a coherent taxonomy of grouped and ranked benchmark problems, comprising of specially developed clinical practice guidelines, patient situations and ideal outcome scenarios, would provide a safe and effective method to both analyse current research and suggest future directions for the discipline. This paper outlines the context and criteria for such tools and describes our work in developing a benchmark capable of being used in the evaluation of systems. 1. Introduction
Clinical practice guidelines (CPG) are ideal candidates for support using computer-based tools and techniques and considerable advances have been made . However, use of such tools in routine clinical practice must be based on a comprehensive and rigorous evaluation cycle that emphasises professionalism and safety. In developing CPGs, health care professionals are expected to predict and represent typical patient profiles, detail expected journeys through typical health care settings, and to consider possible atypical scenarios. Indeed, such considerations are crucial in the conceptualisation, development and use of computer-based CPG tools, and in order for these tools to find wide spread use a comprehensive range of scenarios are necessary. Systemic testing is also very important, and in health care, rigorous hypothetical testing is crucial. We therefore propose that a set of graded and ranked artificial scenarios are required which can fully replicate any issues that may arise in the clinical setting without risk to patients. Such a set of scenarios would represent benchmark problems which could be adopted by the research and development community to comparatively evaluate systems, share representations and collectively advance the discipline through standards.
2. Establishing a Repository of Benchmark Problems
Within the AI planning (AIP) community, the concept of using artificial scenarios for system evaluation is well established. The International Planning Competition (IPC)  is a bi-annual event which allows an empirical comparison of AIP systems from around the world. Its committees have developed seven sets of benchmark problems  from a wide range of domains, each containing problems of scaled complexity. The domains are taken from both current, real-world, applications of AIP and from new fields that have the potential to be effectively supported by the technology Each domain focuses on different elements of AIP capability. As stated by Edelkamp and Hoffmann  the objective of using these problem sets, aside from aiding in the comparative evaluation of various systems, is to “highlight challenges to the community in the form of problems at the edge of current capabilities [and] to propose new directions for research”. These problem sets have been shown to be invaluable for accurately analysing quality and identifying shortcomings of systems, as well as promoting standardisation. The formation of the IPC benchmark domains has been a gradual process, focused on pushing the boundaries of capability within the planning community However, within the guideline modelling community there appears to have been no such process. Worldwide there are currently many medical guideline repositories in the public domain, with guidelines of varying quality and supporting numerous areas of medicine. For example , , , . These guidelines provide clinical best-practice information to healthcare workers and their use is increasing globally. However, very little research has been undertaken into developing problem sets into a coherent taxonomy against which guideline systems (computerised or otherwise) could be evaluated. Certain existing computerised guideline systems such as Asbru  have provided examples and extracts from guidelines (represented in their own format) that are intended to be used for system evaluation. However, these examples offer no formalised indication as to their assessable components and dimensions or their medical or technical intricacy. A set of CPGs does not provide sufficient information to perform the evaluation process. Benchmark problems, similarly to those in the AIP domain, need to also contain a set of scenarios in which these CGPs can be enacted and a set of desired and expected outcomes with which results can be compared. Our definition of a standard problem comprises three components; a single, paired or grouped set of guidelines, a range of patient situations and outcome descriptions for each situation. The complexity of benchmark problems ensues from a wide range of variables related to not only the guideline but also the patient, the health care team and the health care setting. Situations can become even more complex when multiple CPGs are required to run concurrently; a common situation. Pairing or grouping of commonly (medically) associated CPGs would therefore be necessary in order to provide for the evaluation CPG concurrency and interaction. Benchmark problems could also be taken further by relating their complexity to the administrative components of healthcare planning alongside clinical ones. While guidelines and patient scenarios
alone may provide enough information to enable the evaluation of clinical complexity, descriptions of healthcare environments would be necessary to allow for the evaluation of systems which provide advice on administrative elements such as resource and time management.
2. Coherent Taxonomy of CPG system requirements
There are a number of dimensions over which the evaluation of CPG systems needs to be performed, including a system’s ability to provide safe and logically sound advice, to predict problems and provide possible solution paths, and to establish the limits of a system’s capability. This last dimension is crucial to the further development of the field as it serves to highlight the distance between the ideal and the currently possible functionality. In a similar way that benchmark problems have been instrumental in advancing AIP research, the creation of an appropriate set of benchmark problems for the medical domain, graded in complexity and based on real-world scenarios, would allow not only for the evaluation of systems but also provide an opportunity to more clearly collate and categorise the specific requirements of CPG systems. A preliminary taxonomy of requirements for such a system was defined in . This taxonomy was developed by looking not only at the requirements established by existing guideline systems but also by analysing and developing hypothetical scenarios with clinicians and guideline modelling experts from Cancer Research UK. As it is often the case that guidelines are written with existing systems, structures or technologies in mind, their content may be limited in scope to what is currently achievable with those technologies, which could lead to fundamental concepts or information being ignored or obscured. It is considered therefore that in addition to the analysis of guidelines from established repositories, the development of new guideline scenarios would also be necessary. The taxonomy of benchmark problems should also be graded in complexity within a range from simple to beyond the capabilities of current technology. The ‘wish list’ aspect of the latter extreme would influence and inform future research.
3. Example Benchmark Problems
As a collaborative venture between computer scientists, clinicians and guideline modeling experts, a benchmark problem has been developed containing two companion knowledge-rich guidelines, a set of patient situations and their expected outcome scenarios. The guidelines are for Breast Cancer Treatment (BCT) and Aortic Stenosis with Transient Ischemic Attacks (AS). This problem has formed the basis of the hypothetical scenarios mentioned above. The BCT guideline demonstrates a common medical scenario taken from the domain of oncology. The AS guideline was chosen as a ‘companion’ in order to demonstrate the potential interaction between two guidelines and evaluate guideline integration.
Each guideline is made up of a hierarchical set of actions with the highest level plan being known as the ‘generic guideline’. The actions within the guideline are of three types (primitive, set, or cyclic), each having a set of conditions which must be met in order for the action to be performed. Primitive actions are single step actions which cannot be decomposed any further. Set actions contain a number of explicit methods for performing the action, any of which can be applied to the plan. Cyclic actions define repetitive sequences of performing another action (of any type) called the ‘use action’.
Each of the expansions within a set action has its own set of conditions (superseding and extending the conditions of the action as a whole) which must be met for that method to be chosen. Generic guidelines and set actions contain ordering constraints which state the minimum, maximum and ideal delays between the tasks within them. Cyclic actions state the number of times the use action is to be repeated, alongside the minimum, maximum and ideal delays both between each repetition and between specific repetitions. The following shows the BCT Guideline with each of its sub-actions and two patient situations in which this guideline could be applied. The first scenario is a simple example of a patient who can directly follow the BCT guideline. The second example is a more complex scenario in which the BCT guideline is applied to a patient who has already commenced the AS guideline. These are, as far as possible, expressed in natural English in this paper but have also been represented logically in our formal CIG-plan format.
Breast Cancer Treatment Generic Guideline: Treat a patient who has been diagnosed with breast cancer. Actions: Admit Patient (n1), Surgery Treatment (n2), Adjuvant Therapy (n3). Ordering Constraints: n2 starts between 0-14 days after n1 ends. Ideal delay = 1 day.
n3 starts between 0-3 days after n2 ends. Ideal delay = 0 day.
Admit Patient: Admit patient P to Department D. Conditions: At start of action patient must be at the hospital. Before the end of the action the patient must be at the correct department and the admission forms must be completed. Effect: Patient is classed as in-patient at the end of the action. Surgery Treatment: Surgery treatment options for breast cancer. Conditions: The patient must be fit for surgery and be in the hospital. The patient must not take Warfarin from 7 days before the start of the action to 3 days after the end of the action or Asprin from 7 days before the start of the action to 1 day after the end of the action. The treatment should be suspended if anaphylaxis occurs. Effect: At the end of the action surgery is completed. Expansions: 1) Lump Removal Actions: Lump Removal Surgery (n1), Pathology (n2), Radiotherapy Cycle (n3). Ordering Constraints: n2 starts 0-2 days after n1 ends. Ideal delay = 15 mins.
n3 starts between 1-6 days after n1 ends. Ideal delay = 3 days.
Conditions: Patient cancer size <5cm. Patient does not have existing heart condition (flexible condition).
Actions: MRM Surgery (n1), Pathology (n2). Ordering Constraints: n2 starts 0-2 days after n1 ends. Ideal delay = 15 mins.
Lump Removal Surgery: Breast cancer surgery - lump removal. Conditions: The patient must be fit for surgery and be in the hospital. The patient must not have stage 4 cancer and must not take Warfarin from 7 days before the start of the action to 3 days after the end of the action or Asprin from 7 days before the start of the action to 1 day after the end of the action. Patient cancer size must be less than 5cm else treatment should be stopped. Effect: At the end of the action there is 60% locoregional control (if the cancer has not spread) and the lump is removed. Patient cosmesis is good. Patient will have moderate discomfort for 14 days after the end of the action. MRM Surgery: Breast Removal Surgery. Conditions: The patient must be fit for surgery and be in the hospital. The patient must not have stage 4 cancer and must not take Warfarin from 7 days before the start of the action to 3 days after the end of the action or Asprin from 7 days before the start of the action to 1 day after the end of the action. Effect: At the end of the action there is an 80% chance of 100% locoregional control (if the cancer has not spread) and the lump is removed. Patient cosmesis is poor. Patient will have severe discomfort for 14 days after the end of the action. Pathology: Perform pathology tests on sample. Conditions: The sample tissue must be less than 2 days old. Effect: At the end of the action there are test results available. Radiotherapy Cycle: 6 week cycle of radiotherapy. Conditions: It is preferable for the patient to have had the cancerous lump removed before radiotherapy. The patient can have no more than 30 radiotherapy treatments on the same physiological area. Patients should not be given radiotherapy if they have a severe heart condition. Treatment should be aborted if there is a severe intolerance. Effect: At the end of the action there is a 40% chance of 100% locoregional control if the lump has been removed prior to treatment. Use Action: Radiotherapy. Cycle: 30 repetitions with 1-3 days between each repetition. 5 treatments per week. Give Radiotherapy: Single radiotherapy treatment. Conditions: It is preferable for the patient to have had the cancerous lump removed before radiotherapy. The patient must have had less than 30 radiotherapy treatments on the same physiological area. Patients should not be given radiotherapy if they have
a severe heart condition. Treatment should be suspended if there is a mild intolerance and aborted if there is a severe intolerance. Effect: At the end of the action one radiotherapy session is completed. Adjuvant Therapy: Adjuvant therapy to be used after or without surgery. Conditions: Pathology results for patient must be complete. Expansions: 1) Chemotherapy Actions: AI Cycle (n1), Clinical Exam Cycle (n2). Ordering Constraints: Start n1 and n2 at the same time.
Actions: Tamoxifen Cycle (n1), Clinical Exam Cycle (n2). Ordering Constraints: Start n1 and n2 at the same time.
Actions: Chemotherapy Cycle (n1), AI Cycle (n2), Clinical Exam Cycle (n3). Ordering Constraints: n2 starts between 1-3 days after n1 ends. Ideal delay = 1 day.
Chemotherapy Cycle: 6 episodes of chemotherapy treatment. Conditions: At the start of the action the patient’s cancer test results must show PR Negative or ER Negative and the tumour must be either less than 10mm in size or the lymph node test results must be positive. The white blood cell count (wbc) must be above 3000. The action should be suspended if chemotherapy toxicity occurs. Effects: At the end of the action there is 40%chance of 100% locoregional control and a 40% chance that the patient will start menopause. From 2 weeks to 6 months after the end of the action there is a 99% chance the patient will have Alopecia. There is a 0.1% chance that the patient will get Leukaemia within 10 years of the action. Use Action: Chemotherapy. Cycle: 6 repetitions with 3-6 weeks between each repetition (ideally 3 weeks). Chemotherapy: One dose of chemotherapy. Conditions: The white blood cell count (wbc) must be above 3000 at start. If the wbc is between 1500-3000 then wait 5 days and check count. If wbc still between 1500- 3000 then do CSF treatment and check count. If wbc<1500 then admit patient & treat until wbc>3000. The action should be suspended if chemotherapy toxicity occurs. Effects: From 3 days to 15 days after the end of the action there is a 60% chance that the patient will have Febrile Neutropnia. From 12 hours to 7 days after the end of the action there is a 60% chance that the patient will have Nausea. AI Cycle: Give AI for 5 years. Conditions: At the start of the action the patient’s cancer test results must show PR Positive or ER Positive and the patient should be post menopausal. Effects: At the end of the action there is 100% locoregional control. For the first 6 weeks of the action there is a 20% chance of nausea. From the fourth week to 6
months there is a 20% chance that the patient will experience hot flashes. From the sixth month to the end of the action there is a 10% chance of Osteoperosis. Use Action: AI. Cycle: 1825 repetitions with one repetition per day. AI: Give one dose of AI. Conditions: At the start of the action the patient’s cancer test results must show PR Positive or ER Positive and the patient should be post menopausal. Effects: One dose of AI taken. Tamoxifen Cycle: Give Tamoxifen for 5 years. Conditions: At the start of the action the patient’s cancer test results must show PR Positive or ER Positive and it is preferred that the patient has no history of DVT. The action should be aborted if the patient experiences DVT. Effects: At the end of the action there is 100% locoregional control. Over the duration of the action the patient has a 3.5% chance of DVT and a 2% chance of stroke. For the first 6 weeks of the action there is a 20% chance of nausea. From the fourth week to 6 months there is a 20% chance of the patient experiencing hot flashes. From the second year there is a 1% chance the patient will get Endometrial Cancer. Use Action: Tamoxifen. Cycle: 1825 repetitions with one repetition per day. Tamoxifen: Give one dose of Tamoxifen. Conditions: At the start of the action the patient’s cancer test results must show PR Positive or ER Positive and it is preferred that the patient has no history of DVT. Effects: One dose of Tamoxifen taken. Clinical Exam Cycle: Clinical exam to be given every 6 months for 5 years. Effects: Exam cycle completed. Use Action: Clinical Exam. Cycle: 11 repetitions with 5½ - 6½ months between each repetition. Clinical Exam: Give a clinical exam. Effects: Exam completed. Scenarios
1) Simple BCT scenario: Patient is a woman aged 40, diagnosed with breast cancer (lump size 3.3cm), having no other medical complaints or allergies. Without any unexpected test results or complications we would expect patient A to receive the Lump Removal surgery followed by Radiotherapy and some adjuvant therapy. In this instance the guideline can be followed exactly as is shown.
2) Combining the BCT & AS guidelines: Patient is a woman aged 55, diagnosed with breast cancer (lump size 6 cm). She has previously had surgery for Aortic Stenosis
and is currently on a long-term treatment regime of daily Warfarin and Aspirin. On the introduction of the BCT guideline to the patient’s treatment plan, a system would be expected to detect the contraindication between the BCT surgery and the Warfarin treatment. Warfarin should be stopped 9 days before surgery and Heparin should be given (beginning 2 days after the termination of the Warfarin). After surgery, Warfarin should be restarted 2 days before the discontinuation of Heparin. Aspirin is also contraindicated the week before surgery and should be suspended during this time.
4. Evaluating with Benchmark Problems
The guideline shown here and its companion guideline (AS) are considered typical and realistic examples from their respective domains and have been used successfully (together with the patient situations and outcome scenarios) during the development and evaluation of our CPG system research vehicle CIG-Plan (the Computerised Integrated Guideline Planner). The use of benchmark problems has proven valuable for evaluating both the system functionality and the representational requirements of CIG-Plan, without imposing the constraints or limitations of existing formats. They are technically intricate enough exemplify a number of the previously outlined requirements of CGS , including the support of complex temporal representations, multi-dimensional constraints, guideline adaptation to external requirements and guideline integration. Alongside their ability to provide for the evaluation of different aspects of the system, the application of the guidelines to the graded range of patient situations has varied the difficulty and complexity of the benchmark problems. The table in figure 1 states explicitly which areas of the pre-established taxonomy have been evaluated by the problems.
Requirement BCT AS &AS Action representation
1.1 Library of actions: Eligibility criteria (contextual differences,
local preferences), Low level detailed specifications, Chosen by
1.2 Defining primitive & context sensitive actions
1.3 Action states (Relevant, requested, established, accepted, etc.)
Decision Representation & Analysis
2.2 Flexibility for patient / clinician preferences
2.3 Modelling multiple decision alternatives (Choosing optimal
sequences, evaluating all possibilities)
2.5 Use of resource / finance constraints to refine decision options
Goal / Intention Modelling
3.1 Accuracy of representation & language
3.3 Types of goal (avoid, maintain, achieve)
Guideline Flow Representation – Scheduling & Sequencing
4.3 Constraint generation & maintenance
4.4 Sequencing (sequential, parallel, cyclical)
Guideline Flow Representation – Temporal
5.1 Action Types: Instantaneous, durative with continuous effects,
durative with discretised effects, start and end time ranges
5.2 Events: At specific times, in relation to other events, open world
6 Critiquing, Validation, Maintenance & Guideline Development
6.2 Maintaining version control & history
6.4 Factoring finance / resource constraints at development
Fig. 1. Requirements from taxonomy in  which can be evaluated with the Breast Cancer Treatment and Aortic Stenosis guidelines. 5. Conclusions
The need for standardised CPGs and complexity-scaled problems is clear. This paper has shown that, with the availability of knowledge-rich guidelines, a set of scenarios on which to test both representational adequacy and functional effectiveness, and a system able to represent and reason with such information, CPG system research and development can be facilitated. Benchmark problems may be defined as a taxonomy of graded hypothetical scenarios and ideal outcomes. While computerised CPG systems are commonly required to base their reasoning on a chosen (non-empty) set of standard CPGs that may be considered ‘ideal’ or (in our case at least) ‘skeletal’, in routine clinical practice it is common for patients to present with complex individual problems that that do not exactly correspond to these pre-defined scenarios. There are often involved, and in some cases unpredictable, complications related to the environment or the health-care team. In the development of our system, CIG-Plan (designed to provide support for such non-standard patient instances), it was considered vital that realistic problems could be solved in an effective and efficient manner. A set of realistic benchmark problems was therefore essential to the system development and evaluation process. Our approach to this requirement was heavily influenced by similar efforts in the AIP community, which have been successful in providing the means to share research and to advance the discipline; not least in demonstrating contemporary technological limits.
1. Kaiser, K., et al., ed. Computer-based Support for Clinical Guidelines and Protocols.
Proceedings of the Symposium on Computerized Guidelines and Protocols (CGP 2004): Studies in Health Technology and Informatics. Vol. 101. 2004.
2. Gerevini, A. International Planning Competition 2006. [Webpage] 2006 [cited 2006
9/9/06]; Available from: http://zeus.ing.unibs.it/ipc-5/.
3. Edelkamp, S. and J. Hoffmann. International Planning Competition 2004. [Webpage] 2004
9/9/06 [cited; Available from: http://andorfer.cs.uni-dortmund.de/~edelkamp/ipc-4/.
4. U.S. Department of Health and Human Services and Agency for Healthcare Research and
Quality, National Guideline Clearinghouse. 2006: USA.
5. NHS National Library for Health, SEEK: Sheffield Evidence for Effectiveness and
6. FNCLCC, SOR: Recommandations pour la pratique clinique en cancérologie en accès
7. New Zealand Guidelines Group Incorporated, New Zealand Guidelines Group. 2006. 8. Miksch, S. Asgaard - Asbru Information. [Webpage] 2001 12/7/01 [cited 9/9/06]; Available
9. Bradbrook, K., et al. AI Planning Technology as a Component of Computerised Clinical Practice Guidelines in 10th Conference on Artificial Intelligence in Medicine (AIME-05). 2005. Aberdeen, UK: Springer.
20130826 11/10/2013 A & R MULTIMEDIA SRL20130833 11/10/2013 ACTIV CARGO GROUP SRL20130834 11/10/2013 ACTIVE-TRADE IND. SRL20130835 11/10/2013 ACTUAL CONSULTING SRL20130839 11/10/2013 ADI & LUCI TRANS SRL20130840 11/10/2013 ADI & NEL TRANS SRL20130383 11/10/2013 AFTENIE TRANS COM SRL20130389 11/10/2013 AGROMIXTA DEALUL OCNEI SA20130393 11/10/2013 AKA DESING & CONSTRUCT SRL2
Just-in-Time Delivery Comes to Knowledge Management by Thomas H. Davenport and John Glaser The Best ofIntentions John Humphreys Let’s Put Consumers in Charge ofHealth Care Regina E. Herzlinger Management by Fire: A Conversation with ChefAnthony Bourdain When Paranoia Makes Sense The Growth Crisis – and How to Escape It The Mismanagement ofCustomer Loyalty Campai