DEFT 2020

Défi Fouille de Textes@JEP-TALN 2020

semantic similarity and fine-grained information extraction


Following the 2019 issue of the DEFT challenge, we propose a fine-grained information extraction task from clinical cases. In addittion, we propose two tasks about semantic similarity between sentences from Wikipedia.

What are clinical cases?
Clinical cases describe clinical situations of patients, real or fake. The cases are published in various sources (scientific, didactic, associative, legal...). They are de-identified. Their purpose is to present situations that are typical (as in didactic sources) or rare (as in scientific sources).

Global information on the corpus
The corpus used in this challenge is part of a larger corpus with clinical cases, with more complete annotations and associated information [1]. For DEFT 2019, the Organizers focused on clinical cases associated with keywords and discussions. These clinical cases are related to various medical specialties (cardiology, urology, oncology, obstetrics, pulmonology, gastro-enterology...). They have been published in different French-speaking countries (France, Belgium, Switzerland, Canada, African countries, tropical countries...).
The reference data are consensual and obtained from two independent annotations.

[1] N Grabar, V Claveau, C Dalloux. CAS: French Corpus with Clinical Cases. LOUHI 2018, p. 1-7

Access to data
Access to data is possible only after the user agreement is signed by all the team members. The participants can engage in one or more tasks. When getting the data, the participants are committed to submit the results for at least one task.

Tasks Description

Proposed tasks are:

  1. Task 1: identifying the level of semantic similarity between sentences from parallel and non-parallel sentences
    • Purpose: to identify the similarity from 0 to 5 for sentences pairs
    • Input: pair of sentences
    • Output: the similarity level from 0 to 5
    • Evaluation: the difference between reference value and predicted value

  2. Task 2: identifying the possible parallel sentences for a given sentence
    • Purpose: for a given source sentence and several target sentences, to identify the parallel sentence among all target ones
    • Input: a source sentence and several target sentences
    • Output: the corresponding parallel sentence
    • Evaluation: boolean

  3. Task 3: fine-grained information extraction
    • Purpose: identifying fine-grained information among 12 categories in a corpus of clinical cases written in French
      Four domains are covered:
      • patients: anatomy
      • clinical practice: exam, pathology, sign or symptom
      • medical and surgical treatments: substance, dose, duration, frequency, mode of administration, treatment, value
      • time: date, moment
      The annotation guidelines is available in French:
    • Input: a set of clinical cases
    • Ouput: all related information
    • Evaluation: exact match (type and frontiers) for each extracted information w.r.t. gold standard annotations

Evaluation scripts








Program Committee

Organization Committee

  • Rémi CARDON (STL, CNRS, Université de Lille)
  • Natalia GRABAR (STL, CNRS, Université de Lille)
  • Cyril GROUIN (LIMSI, CNRS, Université Paris-Saclay)
  • Thierry HAMON (LIMSI, CNRS, Université Paris-Saclay ; Université Paris XIII)