Publications

Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend Against Supply Chain Attacks

Published in The 2025 ACM International Conference on the Foundations of Software Engineering, 2025

Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely.

Published in , 1900

4.5 Million (Suspected) Fake Stars in GitHub: A Growing Spiral of Popularity Contests, Scams, and Malware

Published in arXiv Preprint, 2024

GitHub, the de-facto platform for open-source software development, provides a set of social-media-like features to signal high-quality repositories. Among them, the star count is the most widely used popularity signal, but it is also at risk of being artificially inflated (i.e., faked), decreasing its value as a decision-making signal and posing a security risk to all GitHub users. In this paper, we present a systematic, global, and longitudinal measurement study of fake stars in GitHub. To this end, we build StarScout, a scalable tool able to detect anomalous starring behaviors (i.e., low activity and lockstep) across the entire GitHub metadata. Analyzing the data collected using StarScout, we find that: (1) fake-star-related activities have rapidly surged since 2024; (2) the user profile characteristics of fake stargazers are not distinct from average GitHub users, but many of them have highly abnormal activity patterns; (3) the majority of fake stars are used to promote short-lived malware repositories masquerading as pirating software, game cheats, or cryptocurrency bots; (4) some repositories may have acquired fake stars for growth hacking, but fake stars only have a promotion effect in the short term (i.e., less than two months) and become a burden in the long term. Our study has implications for platform moderators, open-source practitioners, and supply chain security researchers

How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?

Published in The 2023 ACM 31th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users. In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology that measures the early participation of 290,255 GitHub projects during the first three months with reference to the Blumberg model, trains an XGBoost model to predict project’s two-year sustained activity, and interprets the trained model using LIME. We quantitatively show that early participants have a positive effect on project’s future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project’s sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.

Personalized First Issue Recommender for Newcomers in Open Source Projects

Published in The 38th IEEE/ACM International Conference on Automated Software Engineering , 2023

Many open source projects provide good first issues (GFIs) to attract and retain newcomers. Although several automated GFI recommenders have been proposed, existing recommenders are limited to recommending generic GFIs without considering differences between individual newcomers. However, we observe mismatches between generic GFIs and the diverse background of newcomers, resulting in failed attempts, discouraged onboarding, and delayed issue resolution. To address this problem, we assume that personalized first issues (PFIs) for newcomers could help reduce the mismatches. To justify the assumption, we empirically analyze 37 newcomers and their first issues resolved across multiple projects. We find that the first issues resolved by the same newcomer share similarities in task type, programming language, and project domain. These findings underscore the need for a PFI recommender to improve over state-of-the-art approaches. For that purpose, we identify features that influence newcomers’ personalized selection of first issues by analyzing the relationship between possible features of the newcomers and the characteristics of the newcomers’ chosen first issues. We find that the expertise preference, OSS experience, activeness, and sentiment of newcomers drive their personalized choice of the first issues. Based on these findings, we propose a Personalized First Issue Recommender (PFIRec), which employs LamdaMART to rank candidate issues for a given newcomer by leveraging the identified influential features. We evaluate PFIRec using a dataset of 68,858 issues from 100 GitHub projects. The evaluation results show that PFIRec outperforms existing first issue recommenders, potentially doubling the probability that the top recommended issue is suitable for a specific newcomer and reducing one-third of a newcomer’s unsuccessful attempts to identify suitable first issues, in the median.

Understanding and Remediating Open-Source License Incompatibilities in the PyPI Ecosystem

Published in The 38th IEEE/ACM International Conference on Automated Software Engineering , 2023

The reuse and distribution of open-source software must be in compliance with its accompanying open-source license. In modern packaging ecosystems, maintaining such compliance is challenging because a package may have a complex multi-layered dependency graph with many packages, any of which may have an incompatible license. Although prior research finds that license incompatibilities are prevalent, empirical evidence is still scarce in some modern packaging ecosystems (e.g., PyPI). It also remains unclear how developers remediate the license incompatibilities in the dependency graphs of their packages (including direct and transitive dependencies), let alone any automated approaches. To bridge this gap, we conduct a large-scale empirical study of license incompatibilities and their remediation practices in the PyPI ecosystem. We find that 7.27% of the PyPI package releases have license incompatibilities and 61.3% of them are caused by transitive dependencies, causing challenges in their remediation; for remediation, developers can apply one of the five strategies: migration, removal, pinning versions, changing their own licenses, and negotiation. Inspired by our findings, we propose SILENCE, an SMT-solver-based approach to recommend license incompatibility remediations with minimal costs in package dependency graph. Our evaluation shows that the remediations proposed by SILENCE can match 19 historical real-world cases (except for migrations not covered by an existing knowledge base) and have been accepted by five popular PyPI packages whose developers were previously unaware of their license incompatibilities.

Open Source Software Onboarding as a University Course: An Experience Report

Published in The 2023 IEEE/ACM 45th International Conference on Software Engineering, 2023

Without newcomers, open source software (OSS) projects are hardly sustainable. Yet, newcomers face a steep learning curve during OSS onboarding in which they must overcome a multitude of technical, social, and knowledge barriers. To ease the onboarding process, OSS communities are utilizing mentoring, task recommendation (e.g., “good first issues”), and engagement programs (e.g., Google Summer of Code). However, newcomers must first cultivate their motivation for OSS contribution and learn the necessary preliminaries before they can take advantage of these mechanisms. We believe this gap can be filled by a dedicated, practice-oriented OSS onboarding course. In this paper, we present our experience of teaching an OSS onboarding course at Peking University. The course contains a series of lectures, labs, and invited talks to prepare students with the required skills and motivate them to contribute to OSS. In addition, students are required to complete a semester-long course project in which they plan and make actual contributions to OSS projects. They can either 1) contribute to one of the given OSS projects with dedicated mentoring from the course, or 2) contribute to any OSS project they prefer without such mentoring. Finally, 16 out of 19 students have successfully contributed to open source and five retained. However, the onboarding trajectories and outcomes differ vastly between the two groups of students with different course project choices, yielding lessons for software engineering education.

Automating Dependency Updates in Practice: An Exploratory Study on GitHub Dependabot

Published in IEEE Transactions on Software Engineering, 2023

Dependency update bots automatically open pull requests to update software dependencies on behalf of developers. Early research shows that developers are suspicious of updates performed by bots and feel tired of overwhelming notifications from these bots. Despite this, dependency update bots are becoming increasingly popular. Such contrast motivates us to investigate Dependabot, currently the most visible bot in GitHub, to reveal the effectiveness and limitations of the state-of-art dependency update bots. We use exploratory data analysis and developer survey to evaluate the effectiveness of Dependabot in keeping dependencies up-to-date, reducing update suspicion, and reducing notification fatigue. We obtain mixed findings. On the positive side, Dependabot is effective in reducing technical lag and developers are highly receptive to its pull requests. On the negative side, its compatibility scores are too scarce to be effective in reducing update suspicion; developers tend to configure Dependabot toward reducing the number of notifications; and 11.3% of projects have deprecated Dependabot in favor of other alternatives. Our findings reveal a large room for improvement in dependency update bots which calls for effort from both bot designers and software engineering researchers.

Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting Practices

Published in ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023

High-quality source code comments are valuable for software development and maintenance, however, code often contains low-quality comments or lacks them altogether. We name such source code comments as suboptimal comments. Such suboptimal comments create challenges in code comprehension and maintenance. Despite substantial research on low-quality source code comments, empirical knowledge about commenting practices that produce suboptimal comments and reasons that lead to suboptimal comments are lacking. We help bridge this knowledge gap by investigating (1) independent comment changes (ICCs)—comment changes committed independently of code changes—which likely address suboptimal comments, (2) commenting guidelines, and (3) comment-checking tools and comment-generating tools, which are often employed to help commenting practice—especially to prevent suboptimal comments. We collect 24M+ comment changes from 4,392 open-source GitHub Java repositories and find that ICCs widely exist. The ICC ratio—proportion of ICCs among all comment changes—is ~15.5%, with 98.7% of the repositories having ICC. Our thematic analysis of 3,533 randomly sampled ICCs provides a three-dimensional taxonomy for what is changed (four comment categories and 13 subcategories), how it changed (six commenting activity categories), and what factors are associated with the change (three factors). We investigate 600 repositories to understand the prevalence, content, impact, and violations of commenting guidelines. We find that only 15.5% of the 600 sampled repositories have any commenting guidelines. We provide the first taxonomy for elements in commenting guidelines: where and what to comment are particularly important. The repositories without such guidelines have a statistically significantly higher ICC ratio, indicating the negative impact of the lack of commenting guidelines. However, commenting guidelines are not strictly followed: 85.5% of checked repositories have violations. We also systematically study how developers use two kinds of tools, comment-checking tools and comment-generating tools, in the 4,392 repositories. We find that the use of Javadoc tool is negatively correlated with the ICC ratio, while the use of Checkstyle has no statistically significant correlation; the use of comment-generating tools leads to a higher ICC ratio. To conclude, we reveal issues and challenges in current commenting practice, which help understand how suboptimal comments are introduced. We propose potential research directions on comment location prediction, comment generation, and comment quality assessment; suggest how developers can formulate commenting guidelines and enforce rules with tools; and recommend how to enhance current comment-checking and comment-generating tools.

Self-Admitted Library Migrations in Java, JavaScript, and Python Packaging Ecosystems: A Comparative Study

Published in The 2023 IEEE 30th International Conference on Software Analysis, Evolution and Reengineering, 2023

Reusing open-source software libraries has become the norm in modern software development, but libraries can fail due to various reasons, e.g., security vulnerabilities, lacking features, and end of maintenance. In some cases, developers need to replace a library with another competent library with similar functionalities, i.e., library migration. Previous studies have leveraged library migrations as a unique lens of observation to reveal insights into library selection and dependency management in general. However, they are heavily biased toward Java while the generalizability of their findings remains unknown. In this paper, we present a comparative study on self-admitted library migrations (SALMs) from three packaging ecosystems: Java/Maven, JavaScript/npm, and Python/PyPI. For this study, we design a set of semi-automatic methods that accurately locate SALMs, their domains, and their rationales from git repositories. We reveal that SALMs are prevalent and highly unidirectional in all three ecosystems, and the underlying rationales can be well covered by a previous theoretical framework. Also, SALMs in these ecosystems present domain similarity (testing frameworks, web frameworks, HTTP clients, and serialization). However, we observe differences in the longitudinal trends, the distributions of rationales, the ecosystem-specific domains, and the levels of unidirectionality, all of which indicate that Python/PyPI sees increasingly intense competition between libraries and deserves more research on library recommendation and migration

GFI-Bot: Automated Good First Issue Recommendation on GitHub

Published in The 2022 ACM 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022

To facilitate newcomer onboarding, GitHub recommends the use of “good first issue” (GFI) labels to signal issues suitable for newcomers to resolve. However, previous research shows that manually labeled GFIs are scarce and inappropriate, calling the need for automated recommendations. In this paper, we present GFI-Bot (accessible at https://gfibot.io), a proof-of-concept machine learning powered bot for automated GFI recommendation in practice. Project maintainers can configure GFI-Bot to discover and label possible GFIs so that newcomers can easily locate issues for making their first contributions. GFI-Bot also provides a high-quality, up-to-date dataset for advancing GFI recommendation research.

Recommending Good First Issues in GitHub OSS Projects

Published in The 2022 IEEE/ACM 44th International Conference on Software Engineering, 2022

Attracting and retaining newcomers is vital for the sustainability of an open-source software project. However, it is difficult for newcomers to locate suitable development tasks, while existing “Good First Issues” (GFI) on GitHub are often insufficient and inappropriate. In this paper, we propose RecGFI, an effective practical approach for the recommendation of good first issues to newcomers, which can be used to relieve maintainer burden and help newcomers onboard. RecGFI models an issue with features from multiple dimensions (content, background, and dynamics) and uses an XGBoost classifier to generate its probability of being a GFI. To evaluate RecGFI, we collect 53,510 resolved issues among 100 GitHub projects and carefully restore their historical states to build ground truth datasets. Our evaluation shows that RecGFI can achieve up to 0.853 AUC in the ground truth dataset and outperforms alternative models. Our interpretable analysis of the trained model further reveals interesting observations about GFI characteristics. Finally, we report latest open issues (without GFI-signaling labels but recommended as GFI by our approach) to project maintainers among which 16 are confirmed as real GFIs. Among the 16 confirmed GFIs, two issues have attracted newcomer attention and one has already been resolved by a newcomer.

Demystifying Software Release Note Issues on GitHub

Published in The 2022 IEEE/ACM 30th International Conference on Program Comprehension, 2022

Release notes (RNs) summarize main changes between two consecutive software versions and serve as a central source of information when users upgrade software. While producing high quality RNs can be hard and poses a variety of challenges to developers, a comprehensive empirical understanding on these challenges is still lacking. In this paper, we bridge this knowledge gap by manually analyzing 1,731 latest GitHub issues to build a comprehensive taxonomy of RN issues with four dimensions: Content, Presentation, Accessibility, and Production. Among these issues, nearly half (48.47%) of them focus on Production; Content, Accessibility, and Presentation take 25.61%, 17.65%, and 8.27%, respectively. We find that: 1) RN producers are more likely to miss information than to include incorrect information, especially for breaking changes; 2) improper layout may bury important information leading to user confusion; 3) many users find RNs inaccessible due to link deterioration, lack of notification, and obfuscate RN locations; 4) automating and regulating RN production remain challenging despite the great needs of RN producers. Our taxonomy can serve as a roadmap to improve RN production in practice and also reveal interesting future research directions.

Commercial Participation in OpenStack: Two Sides of a Coin

Published in Computer, Volume 55, Issue 2, 2022

This article provides a landscape of commercial participation in OpenStack, a large-scale open source software (OSS) ecosystem. We discuss how to achieve a balance between maximizing business profit and ensuring the long-term sustainability of OSS ecosystems.

A Large-Scale Empirical Study on Java Library Migrations: Prevalence, Trends, and Rationales

Published in The 2021 ACM 29th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021

With the rise of open-source software and package hosting platforms, reusing 3rd-party libraries has become a common practice. Due to risks including security vulnerabilities, lack of maintenance, unexpected failures, and license issues, a project may completely remove a used library and replace it with another library, which we call library migration. Despite substantial research on dependency management, the understanding of how and why library migrations occur is still lacking. Achieving this understanding may help practitioners optimize their library selection criteria, develop automated approaches to monitor dependencies, and provide migration suggestions for their libraries or software projects. In this paper, through a fine-grained commit-level analysis of 19,652 Java GitHub projects, we extract the largest migration dataset to-date (1,194 migration rules, 3,163 migration commits). We show that 8,065 projects having at least one library removal and 1,564 (lower-bound) to 5,004 (upper-bound) projects have at least one migration, indicating the prevalence of library migrations. We find that projects with library removals have one removal per 139 commits, and projects with migrations have 2 to 4 migrations in median. We discover that library migrations are dominated by several domains presenting a long tail distribution. Also, migrations are highly unidirectional in that libraries are either mostly abandoned or mostly chosen in our project corpus. A thematic analysis on related commit messages, issues, and pull requests identifies 14 frequently mentioned migration reasons, 7 of which are not discussed in previous work. Our findings can be operationalized into actionable insights for package hosting platforms, project maintainers, and library developers.

MigrationAdvisor: Recommending Library Migrations from Large-Scale Open-Source Data

Published in The 2021 IEEE/ACM 43rd International Conference on Software Engineering, 2021

During software maintenance, developers may need to migrate an already in-use library to another library with similar functionalities. However, it is difficult to make the optimal migration decision with limited information, knowledge, or expertise. In this paper, we present MigrationAdvisor, an evidence-based tool to recommend library migration targets through intelligent analysis upon a large number of GitHub repositories and Java libraries. The migration advisories are provided through a search engine style web service where developers can seek migration suggestions for a specific library. We conduct systematic evaluations on the correctness of results, and evaluate the usefulness of the tool by collecting usage feedback from industry developers. Video: https://youtu.be/4I75W22TqwQ.

A Multi-Metric Ranking Approach for Library Migration Recommendations

Published in The 2021 IEEE 28th International Conference on Software Analysis, Evolution and Reengineering, 2021

The wide adoption of third-party libraries in software projects is beneficial but also risky. An already-adopted third-party library may be abandoned by its maintainers, may have license incompatibilities, or may no longer align with current project requirements. Under such circumstances, developers need to migrate the library to another library with similar functionalities, but the migration decisions are often opinion-based and sub-optimal with limited information at hand. Therefore, several filtering-based approaches have been proposed to mine library migrations from existing software data to leverage “the wisdom of crowd,” but they suffer from either low precision or low recall with different thresholds, which limits their usefulness in supporting migration decisions. In this paper, we present a novel approach that utilizes multiple metrics to rank and therefore recommend library migrations. Given a library to migrate, our approach first generates candidate target libraries from a large corpus of software repositories, and then ranks them by combining the following four metrics to capture different dimensions of evidence from development histories: Rule Support, Message Support, Distance Support, and API Support. We evaluate the performance of our approach with 773 migration rules (190 source libraries) that we borrow from previous work and recover from 21,358 Java GitHub projects. The experiments show that our metrics are effective to help identify real migration targets, and our approach significantly outperforms existing works, with MRR of 0.8566, top-1 precision of 0.7947, top-10 NDCG of 0.7702, and top-20 recall of 0.8939. To demonstrate the generality of our approach, we manually verify the recommendation results of 480 popular libraries not included in prior work, and we confirm 661 new migration rules from 231 of the 480 libraries with comparable performance. The source code, data, and supplementary materials are provided at: https://github.com/hehao98/MigrationHelper.

Understanding Source Code Comments at Large-Scale

Published in The 2019 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019

Source code comments are important for any software, but the basic patterns of writing comments across domains and programming languages remain unclear. In this paper, we take a first step toward understanding differences in commenting practices by analyzing the comment density of 150 projects in 5 different programming languages. We have found that there are noticeable differences in comment density, which may be related to the programming language used in the project and the purpose of the project.

Hao He

Publications