On the Coherence of Comments and Implementations: A Public Benchmark

This page contains the Replication Package of the Empirical Investigation described in the paper On the Coherence of Comments and Implementations: A Public Benchmark,

This page provides the data of the benchmark and the corresponding documentation.

Abstract

Source code comments provide useful information on the implementation of a software and on the intent behind design decisions and goals. Writing informative comments is far from being a trivial task. Moreover, source code comments tend to remain mostly unchanged during maintenance activities. As a consequence, the information provided in the lead comment of a method and in its corresponding implementation may be not coherent with each other (e.g., the comment does not properly describe the implementation). In this paper, we present the results of a manual assessment on the coherence between comments and implementations of 3636 methods, gathered from 3 Java open source software systems (for one of these systems, we considered 2 different releases). Resulting evaluations have been collected in a benchmark, we made publicly available on the web. The defined and used assessment protocol is also described.

Benchmark Report Files

Benchmark data contains the results on the Coherence of Comments and Implementations of methods gathered from 4 open source systems implemented in the Java language, namely:

  • CoffeeMaker
  • JFreeChart 0.6.0
  • JFreeChart 0.7.1
  • JHotDraw 7.4.1

The data have been extracted and dumped in different report files, in order to allow for multiple analysis, depending on the purpose of the Researcher aimed to replicate our study.

Report files are organized in two different groups:

  • report files whose name ends with _Coherence_Data.txt that contain information about the Coherence of Methods (i.e., lead comments and their corresponding implementation);

  • report files whose name ends with _Raw_Data.txt, providing all the raw data of corresponding methods (e.g., method name, its class name, (relative) path to the corresponding source code file).

Report files in both the categories are further distinguished in two classes, depending on the granularity of the data provided, i.e., entire benchmark data, or project-specific data.

Replication Package

The Replication Dataset contains the following report files:

The Source Code packages (.zip files) used to create the benchmark are also provided:

Report Files Structure

The structure of all the report files is simple and well documented in order to ease as much as possible automatic parsing algorithms.

Coherence Data Report Structure

Report files providing information about the Coherence of methods are structured according to the CSV (i.e., Comma Separated Values) format. Each line of the file contains the following information:

method_id, coherence
  • method_id: the unique identifier of the corresponding method
  • coherence : the coherence value associated to the comment and the implementation of the referred method. Allowed Coherence Values are: NOT_COHERENT and COHERENT. In case, it would be more than straightforward to translate these values into 0, 1 values, respectively.

Report files following this structure are:

  • Benchmark_Coherence_Data.txt
  • CoffeeMaker_Coherence_Data.txt
  • JFreeChart060_Coherence_Data.txt
  • JFreeChart071_Coherence_Data.txt
  • JHotDraw741_Coherence_Data.txt

Raw Data Report Structure

All the report files containing the raw data of the methods share exactly the same multiline structure. That is (for each method):

method_id, method_name, class_name, software_system
filepath, start_line, end_line,
Length of the Head Comments
Head Comment
Length of the Implementation
Method Implementation
###
  • method_id: the unique identifier of the corresponding method
  • method_name: the name of the target method
  • class_name: the name of the Class containing the target method
  • software_system: the name of the Software System from which the method has been extracted
  • filepath: (Relative) path to the source code file in the Software Package
  • start_line: First line number of the method implementation within the original source file
  • end_line: Last line number of the method implementation within the original source file
  • Length of the Head Comment: the number of the following lines referring to the Head Comment of the method (this information is reported to aid the parsing of the report file)
  • Head Comment: Head Comment of the Method
  • Length of the Implementation: the number of the following lines referring to the method implementation (this information is reported to aid the parsing of the report file)
  • Method Implementation: code of the method implementation
  • ###: delimiter to separates the raw data of a method from the others.

NOTE: - The first two lines report the data in the CSV format; - Even if the information about the Software System Name is redundant in the case of project-specific reports, the same structure allows to define a single algorithm able to parse all the report files.

Report files following this structure are:

  • Benchmark_Raw_Data.txt
  • CoffeeMaker_Raw_Data.txt
  • JFreeChart060_Raw_Data.txt
  • JFreeChart071_Raw_Data.txt
  • JHotDraw741_Raw_Data.txt