Evaluation

Metrics

Evaluation will be done according to the following metrics:

Task 1 – Detection:

  • Sensitivity
  • False Positive Count (total per scan)

Task 2 – Segmentation:

  • Dice Similarity Coefficient
  • Hausdorff distance (modified, 95th percentile)
  • Volumetric Similarity  

Indication of how this metrics can be determined can be found here.

Individual aneurysms are defined as 3D connected components. The full source code that will be used for evaluation for each task can be found here: evaluation.

It is worth noting that the detection/segmentation of the treated (e.g. coiled) aneurysms will not be considered when assessing the performance of untreated aneurysm detection or segmentation. Any false positive detections at the location of treated aneurysms will be ignored during evaluation. The above metrics will only be determined for untreated, unruptured aneurysms.

A positive detection is considered when the candidate location coordinate is within the radius distance of the ground truth centre of mass location of the aneurysm.

Ranking

Each metric is averaged over all test scans. For each metric, the participating teams are sorted from best to worst. The best team receives a rank of 0 and the worst team a rank of 1; all other teams are ranked (0,1) relative to their performance within the range of that metric. Finally, the five ranks are averaged into the overall rank that is used for the Results.

Task 1

For example: the best team A has a Sensitivity of 80 and the worst team B a Sensitivity of 60. In the ranking: A=0.00 and B=1.00. Another team C has a Sensitivity of 78, which is then ranked at 1.0 - (78 - 60) / (80 - 60) = 0.10. The actual Python code to compute this is:

import pandas 

def getRankingHigherIsBetter(df, metric): 
  return 1.0 - getRankingLowerIsBetter(df, metric) 

def getRankingLowerIsBetter(df, metric): 
  rank = df.groupby('team')[metric].mean() 

  lowest  = rank.min()
  highest = rank.max()

  return (rank - lowest) / (highest - lowest) 

# Pandas DataFrame containing the results for each team for each test image 
df = loadResultData()       
rankSens    = getRankingHigherIsBetter(df, 'Sens') 
rankFPCount    = getRankingLowerIsBetter(df, 'FPCount')    

finalRank = (rankSens + rankFPCount) / 2

Task 2

Where no aneurysm is present in the original TOF-MRA, these cannot be assessed in terms of segmentation metrics.

For example: the best team A has a DSC of 80 and the worst team B a DSC of 60. In the ranking: A=0.00 and B=1.00. Another team C has a DSC of 78, which is then ranked at 1.0 - (78 - 60) / (80 - 60) = 0.10. The actual Python code to compute this is:

import pandas 

def getRankingHigherIsBetter(df, metric): 
  return 1.0 - getRankingLowerIsBetter(df, metric) 

def getRankingLowerIsBetter(df, metric): 
  rank = df.groupby('team')[metric].mean() 

  lowest  = rank.min()
  highest = rank.max()

  return (rank - lowest) / (highest - lowest) 

# Pandas DataFrame containing the results for each team for each test image 
df = loadResultData()       
rankDsc    = getRankingHigherIsBetter(df, 'dsc') 
rankH95    = getRankingLowerIsBetter(df, 'h95') 
rankVS    = getRankingLowerIsBetter(df, 'vs') 
rankSens    = getRankingHigherIsBetter(df, 'Sens') 
rankFPCount    = getRankingLowerIsBetter(df, 'FPCount')    

finalRank = (rankDsc + rankH95 + rankVS + rankSens + rankFPCount) / 5