Thu 25 Mar 2021 14:00 - 14:30 at Virtual Space A - Session 7 Chair(s): Emma Söderberg
Fri 26 Mar 2021 17:00 - 17:30 at Virtual Space A - Session 18 Chair(s): Jens Lincke

Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication – code clones – in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual snippets, and study the extent to which snippets are recurring across multiple notebooks. We study both identical clones and approximate clones and conduct a small-scale ocular inspection of the most common clones. We find that code cloning is common in Jupyter notebooks – more than 70% of all code snippets are exact copies of other snippets (with possible differences in white spaces), and around 50% of all notebooks do not have any unique snippet, but consists solely of snippets that are also found elsewhere. In notebooks written in Python, at least 80% of all snippets are approximate clones and the prevalence of code cloning is higher in Python than in other languages. We further find that clones between different repositories are far more common than clones within the same repository. However, the most common individual repository from which a Jupyter notebook contains clones is the repository in which itself resides.

Conference Day
Thu 25 Mar

Displayed time zone: Belfast change

13:00 - 14:30
Session 7Research Papers at Virtual Space A
Chair(s): Emma SöderbergLund University
13:00
30m
Live Q&A
Transparent Synchronous Dataflow
Research Papers
Steven CheungUniversity of Birmingham, UK, Dan GhicaUniversity of Birmingham, Koko MuroyaRIMS, Kyoto University, JP
DOI Media Attached
13:30
30m
Live Q&A
Consistency types for replicated data in a higher-order distributed programming language
Research Papers
Xin ZhaoKTH Royal Institute of Technology, Philipp HallerKTH
DOI Media Attached
14:00
30m
Live Q&A
Jupyter Notebooks on GitHub: Characteristics and Code Clones
Research Papers
Malin KällénUppsala University, Tobias WrigstadUppsala University, Sweden
DOI Media Attached

Conference Day
Fri 26 Mar

Displayed time zone: Belfast change

17:00 - 17:30
Session 18Research Papers at Virtual Space A
Chair(s): Jens LinckeHasso Plattner Institute, University of Potsdam, Germany
17:00
30m
Live Q&A
Jupyter Notebooks on GitHub: Characteristics and Code Clones
Research Papers
Malin KällénUppsala University, Tobias WrigstadUppsala University, Sweden
DOI Media Attached