Posts by Collection

publications

Understanding Source Code Comments at Large-Scale

Published in The 2019 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019

Source code comments are important for any software, but the basic patterns of writing comments across domains and programming languages remain unclear. In this paper, we take a first step toward understanding differences in commenting practices by analyzing the comment density of 150 projects in 5 different programming languages. We have found that there are noticeable differences in comment density, which may be related to the programming language used in the project and the purpose of the project.

A Multi-Metric Ranking Approach for Library Migration Recommendations

Published in The 2021 IEEE 28th International Conference on Software Analysis, Evolution and Reengineering, 2021

The wide adoption of third-party libraries in software projects is beneficial but also risky. An already-adopted third-party library may be abandoned by its maintainers, may have license incompatibilities, or may no longer align with current project requirements. Under such circumstances, developers need to migrate the library to another library with similar functionalities, but the migration decisions are often opinion-based and sub-optimal with limited information at hand. Therefore, several filtering-based approaches have been proposed to mine library migrations from existing software data to leverage “the wisdom of crowd,” but they suffer from either low precision or low recall with different thresholds, which limits their usefulness in supporting migration decisions. In this paper, we present a novel approach that utilizes multiple metrics to rank and therefore recommend library migrations. Given a library to migrate, our approach first generates candidate target libraries from a large corpus of software repositories, and then ranks them by combining the following four metrics to capture different dimensions of evidence from development histories: Rule Support, Message Support, Distance Support, and API Support. We evaluate the performance of our approach with 773 migration rules (190 source libraries) that we borrow from previous work and recover from 21,358 Java GitHub projects. The experiments show that our metrics are effective to help identify real migration targets, and our approach significantly outperforms existing works, with MRR of 0.8566, top-1 precision of 0.7947, top-10 NDCG of 0.7702, and top-20 recall of 0.8939. To demonstrate the generality of our approach, we manually verify the recommendation results of 480 popular libraries not included in prior work, and we confirm 661 new migration rules from 231 of the 480 libraries with comparable performance. The source code, data, and supplementary materials are provided at: https://github.com/hehao98/MigrationHelper.

MigrationAdvisor: Recommending Library Migrations from Large-Scale Open-Source Data

Published in The 2021 IEEE/ACM 43rd International Conference on Software Engineering, 2021

During software maintenance, developers may need to migrate an already in-use library to another library with similar functionalities. However, it is difficult to make the optimal migration decision with limited information, knowledge, or expertise. In this paper, we present MigrationAdvisor, an evidence-based tool to recommend library migration targets through intelligent analysis upon a large number of GitHub repositories and Java libraries. The migration advisories are provided through a search engine style web service where developers can seek migration suggestions for a specific library. We conduct systematic evaluations on the correctness of results, and evaluate the usefulness of the tool by collecting usage feedback from industry developers. Video: https://youtu.be/4I75W22TqwQ.

A Large-Scale Empirical Study on Java Library Migrations: Prevalence, Trends, and Rationales

Published in The 2021 ACM 29th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021

With the rise of open-source software and package hosting platforms, reusing 3rd-party libraries has become a common practice. Due to risks including security vulnerabilities, lack of maintenance, unexpected failures, and license issues, a project may completely remove a used library and replace it with another library, which we call library migration. Despite substantial research on dependency management, the understanding of how and why library migrations occur is still lacking. Achieving this understanding may help practitioners optimize their library selection criteria, develop automated approaches to monitor dependencies, and provide migration suggestions for their libraries or software projects. In this paper, through a fine-grained commit-level analysis of 19,652 Java GitHub projects, we extract the largest migration dataset to-date (1,194 migration rules, 3,163 migration commits). We show that 8,065 projects having at least one library removal and 1,564 (lower-bound) to 5,004 (upper-bound) projects have at least one migration, indicating the prevalence of library migrations. We find that projects with library removals have one removal per 139 commits, and projects with migrations have 2 to 4 migrations in median. We discover that library migrations are dominated by several domains presenting a long tail distribution. Also, migrations are highly unidirectional in that libraries are either mostly abandoned or mostly chosen in our project corpus. A thematic analysis on related commit messages, issues, and pull requests identifies 14 frequently mentioned migration reasons, 7 of which are not discussed in previous work. Our findings can be operationalized into actionable insights for package hosting platforms, project maintainers, and library developers.

Commercial Participation in OpenStack: Two Sides of a Coin

Published in Computer, Volume 55, Issue 2, 2022

This article provides a landscape of commercial participation in OpenStack, a large-scale open source software (OSS) ecosystem. We discuss how to achieve a balance between maximizing business profit and ensuring the long-term sustainability of OSS ecosystems.

Demystifying Software Release Note Issues on GitHub

Published in The 2022 IEEE/ACM 30th International Conference on Program Comprehension, 2022

Release notes (RNs) summarize main changes between two consecutive software versions and serve as a central source of information when users upgrade software. While producing high quality RNs can be hard and poses a variety of challenges to developers, a comprehensive empirical understanding on these challenges is still lacking. In this paper, we bridge this knowledge gap by manually analyzing 1,731 latest GitHub issues to build a comprehensive taxonomy of RN issues with four dimensions: Content, Presentation, Accessibility, and Production. Among these issues, nearly half (48.47%) of them focus on Production; Content, Accessibility, and Presentation take 25.61%, 17.65%, and 8.27%, respectively. We find that: 1) RN producers are more likely to miss information than to include incorrect information, especially for breaking changes; 2) improper layout may bury important information leading to user confusion; 3) many users find RNs inaccessible due to link deterioration, lack of notification, and obfuscate RN locations; 4) automating and regulating RN production remain challenging despite the great needs of RN producers. Our taxonomy can serve as a roadmap to improve RN production in practice and also reveal interesting future research directions.

Recommending Good First Issues in GitHub OSS Projects

Published in The 2022 IEEE/ACM 44th International Conference on Software Engineering, 2022

Attracting and retaining newcomers is vital for the sustainability of an open-source software project. However, it is difficult for newcomers to locate suitable development tasks, while existing “Good First Issues” (GFI) on GitHub are often insufficient and inappropriate. In this paper, we propose RecGFI, an effective practical approach for the recommendation of good first issues to newcomers, which can be used to relieve maintainer burden and help newcomers onboard. RecGFI models an issue with features from multiple dimensions (content, background, and dynamics) and uses an XGBoost classifier to generate its probability of being a GFI. To evaluate RecGFI, we collect 53,510 resolved issues among 100 GitHub projects and carefully restore their historical states to build ground truth datasets. Our evaluation shows that RecGFI can achieve up to 0.853 AUC in the ground truth dataset and outperforms alternative models. Our interpretable analysis of the trained model further reveals interesting observations about GFI characteristics. Finally, we report latest open issues (without GFI-signaling labels but recommended as GFI by our approach) to project maintainers among which 16 are confirmed as real GFIs. Among the 16 confirmed GFIs, two issues have attracted newcomer attention and one has already been resolved by a newcomer.

GFI-Bot: Automated Good First Issue Recommendation on GitHub

Published in The 2022 ACM 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022

To facilitate newcomer onboarding, GitHub recommends the use of “good first issue” (GFI) labels to signal issues suitable for newcomers to resolve. However, previous research shows that manually labeled GFIs are scarce and inappropriate, calling the need for automated recommendations. In this paper, we present GFI-Bot (accessible at https://gfibot.io), a proof-of-concept machine learning powered bot for automated GFI recommendation in practice. Project maintainers can configure GFI-Bot to discover and label possible GFIs so that newcomers can easily locate issues for making their first contributions. GFI-Bot also provides a high-quality, up-to-date dataset for advancing GFI recommendation research.

Self-Admitted Library Migrations in Java, JavaScript, and Python Packaging Ecosystems: A Comparative Study

Published in The 2023 IEEE 30th International Conference on Software Analysis, Evolution and Reengineering, 2023

Reusing open-source software libraries has become the norm in modern software development, but libraries can fail due to various reasons, e.g., security vulnerabilities, lacking features, and end of maintenance. In some cases, developers need to replace a library with another competent library with similar functionalities, i.e., library migration. Previous studies have leveraged library migrations as a unique lens of observation to reveal insights into library selection and dependency management in general. However, they are heavily biased toward Java while the generalizability of their findings remains unknown. In this paper, we present a comparative study on self-admitted library migrations (SALMs) from three packaging ecosystems: Java/Maven, JavaScript/npm, and Python/PyPI. For this study, we design a set of semi-automatic methods that accurately locate SALMs, their domains, and their rationales from git repositories. We reveal that SALMs are prevalent and highly unidirectional in all three ecosystems, and the underlying rationales can be well covered by a previous theoretical framework. Also, SALMs in these ecosystems present domain similarity (testing frameworks, web frameworks, HTTP clients, and serialization). However, we observe differences in the longitudinal trends, the distributions of rationales, the ecosystem-specific domains, and the levels of unidirectionality, all of which indicate that Python/PyPI sees increasingly intense competition between libraries and deserves more research on library recommendation and migration

Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting Practices

Published in ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023

High-quality source code comments are valuable for software development and maintenance, however, code often contains low-quality comments or lacks them altogether. We name such source code comments as suboptimal comments. Such suboptimal comments create challenges in code comprehension and maintenance. Despite substantial research on low-quality source code comments, empirical knowledge about commenting practices that produce suboptimal comments and reasons that lead to suboptimal comments are lacking. We help bridge this knowledge gap by investigating (1) independent comment changes (ICCs)—comment changes committed independently of code changes—which likely address suboptimal comments, (2) commenting guidelines, and (3) comment-checking tools and comment-generating tools, which are often employed to help commenting practice—especially to prevent suboptimal comments. We collect 24M+ comment changes from 4,392 open-source GitHub Java repositories and find that ICCs widely exist. The ICC ratio—proportion of ICCs among all comment changes—is ~15.5%, with 98.7% of the repositories having ICC. Our thematic analysis of 3,533 randomly sampled ICCs provides a three-dimensional taxonomy for what is changed (four comment categories and 13 subcategories), how it changed (six commenting activity categories), and what factors are associated with the change (three factors). We investigate 600 repositories to understand the prevalence, content, impact, and violations of commenting guidelines. We find that only 15.5% of the 600 sampled repositories have any commenting guidelines. We provide the first taxonomy for elements in commenting guidelines: where and what to comment are particularly important. The repositories without such guidelines have a statistically significantly higher ICC ratio, indicating the negative impact of the lack of commenting guidelines. However, commenting guidelines are not strictly followed: 85.5% of checked repositories have violations. We also systematically study how developers use two kinds of tools, comment-checking tools and comment-generating tools, in the 4,392 repositories. We find that the use of Javadoc tool is negatively correlated with the ICC ratio, while the use of Checkstyle has no statistically significant correlation; the use of comment-generating tools leads to a higher ICC ratio. To conclude, we reveal issues and challenges in current commenting practice, which help understand how suboptimal comments are introduced. We propose potential research directions on comment location prediction, comment generation, and comment quality assessment; suggest how developers can formulate commenting guidelines and enforce rules with tools; and recommend how to enhance current comment-checking and comment-generating tools.

Automating Dependency Updates in Practice: An Exploratory Study on GitHub Dependabot

Published in IEEE Transactions on Software Engineering, 2023

Dependency update bots automatically open pull requests to update software dependencies on behalf of developers. Early research shows that developers are suspicious of updates performed by bots and feel tired of overwhelming notifications from these bots. Despite this, dependency update bots are becoming increasingly popular. Such contrast motivates us to investigate Dependabot, currently the most visible bot in GitHub, to reveal the effectiveness and limitations of the state-of-art dependency update bots. We use exploratory data analysis and developer survey to evaluate the effectiveness of Dependabot in keeping dependencies up-to-date, reducing update suspicion, and reducing notification fatigue. We obtain mixed findings. On the positive side, Dependabot is effective in reducing technical lag and developers are highly receptive to its pull requests. On the negative side, its compatibility scores are too scarce to be effective in reducing update suspicion; developers tend to configure Dependabot toward reducing the number of notifications; and 11.3% of projects have deprecated Dependabot in favor of other alternatives. Our findings reveal a large room for improvement in dependency update bots which calls for effort from both bot designers and software engineering researchers.

Open Source Software Onboarding as a University Course: An Experience Report

Published in The 2023 IEEE/ACM 45th International Conference on Software Engineering, 2023

Without newcomers, open source software (OSS) projects are hardly sustainable. Yet, newcomers face a steep learning curve during OSS onboarding in which they must overcome a multitude of technical, social, and knowledge barriers. To ease the onboarding process, OSS communities are utilizing mentoring, task recommendation (e.g., “good first issues”), and engagement programs (e.g., Google Summer of Code). However, newcomers must first cultivate their motivation for OSS contribution and learn the necessary preliminaries before they can take advantage of these mechanisms. We believe this gap can be filled by a dedicated, practice-oriented OSS onboarding course. In this paper, we present our experience of teaching an OSS onboarding course at Peking University. The course contains a series of lectures, labs, and invited talks to prepare students with the required skills and motivate them to contribute to OSS. In addition, students are required to complete a semester-long course project in which they plan and make actual contributions to OSS projects. They can either 1) contribute to one of the given OSS projects with dedicated mentoring from the course, or 2) contribute to any OSS project they prefer without such mentoring. Finally, 16 out of 19 students have successfully contributed to open source and five retained. However, the onboarding trajectories and outcomes differ vastly between the two groups of students with different course project choices, yielding lessons for software engineering education.

Understanding and Remediating Open-Source License Incompatibilities in the PyPI Ecosystem

Published in The 38th IEEE/ACM International Conference on Automated Software Engineering , 2023

The reuse and distribution of open-source software must be in compliance with its accompanying open-source license. In modern packaging ecosystems, maintaining such compliance is challenging because a package may have a complex multi-layered dependency graph with many packages, any of which may have an incompatible license. Although prior research finds that license incompatibilities are prevalent, empirical evidence is still scarce in some modern packaging ecosystems (e.g., PyPI). It also remains unclear how developers remediate the license incompatibilities in the dependency graphs of their packages (including direct and transitive dependencies), let alone any automated approaches. To bridge this gap, we conduct a large-scale empirical study of license incompatibilities and their remediation practices in the PyPI ecosystem. We find that 7.27% of the PyPI package releases have license incompatibilities and 61.3% of them are caused by transitive dependencies, causing challenges in their remediation; for remediation, developers can apply one of the five strategies: migration, removal, pinning versions, changing their own licenses, and negotiation. Inspired by our findings, we propose SILENCE, an SMT-solver-based approach to recommend license incompatibility remediations with minimal costs in package dependency graph. Our evaluation shows that the remediations proposed by SILENCE can match 19 historical real-world cases (except for migrations not covered by an existing knowledge base) and have been accepted by five popular PyPI packages whose developers were previously unaware of their license incompatibilities.

Personalized First Issue Recommender for Newcomers in Open Source Projects

Published in The 38th IEEE/ACM International Conference on Automated Software Engineering , 2023

Many open source projects provide good first issues (GFIs) to attract and retain newcomers. Although several automated GFI recommenders have been proposed, existing recommenders are limited to recommending generic GFIs without considering differences between individual newcomers. However, we observe mismatches between generic GFIs and the diverse background of newcomers, resulting in failed attempts, discouraged onboarding, and delayed issue resolution. To address this problem, we assume that personalized first issues (PFIs) for newcomers could help reduce the mismatches. To justify the assumption, we empirically analyze 37 newcomers and their first issues resolved across multiple projects. We find that the first issues resolved by the same newcomer share similarities in task type, programming language, and project domain. These findings underscore the need for a PFI recommender to improve over state-of-the-art approaches. For that purpose, we identify features that influence newcomers’ personalized selection of first issues by analyzing the relationship between possible features of the newcomers and the characteristics of the newcomers’ chosen first issues. We find that the expertise preference, OSS experience, activeness, and sentiment of newcomers drive their personalized choice of the first issues. Based on these findings, we propose a Personalized First Issue Recommender (PFIRec), which employs LamdaMART to rank candidate issues for a given newcomer by leveraging the identified influential features. We evaluate PFIRec using a dataset of 68,858 issues from 100 GitHub projects. The evaluation results show that PFIRec outperforms existing first issue recommenders, potentially doubling the probability that the top recommended issue is suitable for a specific newcomer and reducing one-third of a newcomer’s unsuccessful attempts to identify suitable first issues, in the median.

How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?

Published in The 2023 ACM 31th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users. In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology that measures the early participation of 290,255 GitHub projects during the first three months with reference to the Blumberg model, trains an XGBoost model to predict project’s two-year sustained activity, and interprets the trained model using LIME. We quantitatively show that early participants have a positive effect on project’s future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project’s sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.

A Comprehensive Analysis of Challenges and Strategies for Software Release Notes on GitHub

Published in Empirical Software Engineering, Volume 29, Article Number 104, 2024

Release notes (RNs) refer to the technical documentation that offers users, developers, and other stakeholders comprehensive information about the changes and updates of a new software version. Producing high-quality RNs can be challenging, and it remains unknown what issues developers commonly encounter and what effective strategies can be adopted to mitigate them. To bridge this knowledge gap, we conduct a manual analysis of 1,529 latest RN-related issues in the GitHub issue tracker by using multiple rounds of open coding and construct 1) a comprehensive taxonomy of RN-related issues with four dimensions validated through three semi-structured interviews; 2) an effective framework with eight categories of strategies to overcome these challenges. The four dimensions of RN-related issues revealed by the taxonomy and the corresponding strategies from the framework include: 1) Content (419, 25.47%): RN producers tend to overlook information rather than include inaccurate details, especially for breaking changes. To address this, effective completeness validations are recommended, such as managing Pull Requests, issues, and commits related to RNs; 2) Presentation (150, 9.12%): inadequate layout may bury important information and lead to end users’ confusion, which can be mitigated by employing a hierarchical structure, standardized format, rendering RNs, and folding techniques; 3) Accessibility (303, 18.42%): many users find RNs inaccessible due to link deterioration, insufficient notification, and obfuscated RN locations. This can be alleviated by adopting appropriate locations and channels (such as project websites) and standardizing link management.; 4) Production (773, 46.99%): despite the high demand from RN producers, automating and standardizing the RN production process remains challenging. Developers resolve this problem by using some mature tools on GitHub (like Release Drafter). Additionally, offering guidance, clarifying responsibilities, and distributing workloads are effective in improving collaboration within the team. Mechanisms for distributing and verifying RNs are also selected to enhance synchronization management. Our taxonomy provides a comprehensive blueprint to improve RN production in practice and also reveals interesting future research directions.

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Published in The 2025 47th IEEE/ACM International Conference on Software Engineering, 2025

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for striking similarity that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundWation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend Against Supply Chain Attacks

Published in The 2025 ACM International Conference on the Foundations of Software Engineering, 2025

Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely.

The Structure of Cross-National Collaboration in Open-Source Software Development

Published in The 2025 34th ACM International Conference on Information and Knowledge Management , 2025

Open-source software (OSS) development platforms, such as GitHub, expand the potential for cross-national collaboration among developers by lowering the geographic, temporal, and coordination barriers that limited software innovation in the past. However, as shown in previous research, these technological affordances that lower barriers for cross-national collaboration do not uniformly benefit all countries. Using the GitHub Innovation Graph dataset, which aggregates the complete cross-country collaborations among the entire population of GitHub developers, we show the overwhelming dominance of the U.S. in global OSS collaboration and, in stark contrast, the paucity of collaborations among the majority of non-Western country pairs in the data. However, by removing the U.S. and its ties from the collaboration network, we find quantitative evidence of deep-seated religious and cultural affinities, shared colonial histories, and geopolitical factors structuring the collaborations between non-U.S. country pairs. This study highlights the opportunities to develop decentralizing strategies to facilitate new collaborations between developers in non-U.S. countries, thereby fostering the development of novel, innovative solutions. More generally, this study also underscores the importance of contextualizing user behavior and knowledge management in information systems with long-term, macro-social conditions in which these systems are inextricably embedded.

Designing Abandabot: When Does Open Source Dependency Abandonment Matter?

Published in The 2026 48th IEEE/ACM International Conference on Software Engineering, 2026

Despite the inevitable risk that depending on abandoned open source dependencies poses, many developers feel a lack of resources and guidance on how to deal with this. Automated detection of abandonment is feasible, but not all abandoned dependencies impact a downstream project equally. In this paper, we perform a need-finding interview study with 22 open source maintainers to explore what makes the abandonment of certain dependencies impactful to their project, as well as their information needs and design requirements for such an automated notification tool. We find four main factors, the depth of integration, the availability of alternatives, the importance of the functionality, and external environmental pressures. Using this emerging theory, we then build an LLM-based classifier to predict the impact of a dependency’s abandonment in a given context, and evaluate it with an independent user study with 124 open source maintainers. Our results show that the classifier is effective at predicting whether a dependency’s abandonment would be impactful to a project, and that theory-based explanations given by the LLM are useful to developers when making judgments about the potential impactfulness of a given dependency’s abandonment.

Six Million (Suspected) Fake Stars in GitHub: A Growing Spiral of Popularity Contests, Spams, and Malware

Published in The 2026 48th IEEE/ACM International Conference on Software Engineering, 2026

GitHub, the de facto platform for open-source software development, provides a set of social-media-like features to signal high-quality repositories.b Among them, the star count is the most widely used popularity signal, but it is also at risk of being artificially inflated (i.e., mph{faked}), decreasing its value as a decision-making signal and posing a security risk to all GitHub users. In this paper, we present a systematic, global, and longitudinal measurement study of fake stars in GitHub. To this end, we build StarScout, a scalable tool able to detect anomalous starring behaviors across the entire GitHub metadata in the last five years. Analyzing the data collected using StarScout, we find that: (1) fake-star-related activities have rapidly surged since 2024; (2) the accounts and repositories in fake star campaigns have highly trivial activity patterns; (3) the majority of fake stars are used to promote short-lived phishing malware repositories; the remaining ones are mostly used to promote AI/LLM, blockchain, tool/application, and tutorial/demo repositories; (4) while repositories may have acquired fake stars for growth hacking, fake stars only have a promotion effect in the short term (i.e., less than two months) and become a liability in the long term. Our study has implications for platform moderators, open-source practitioners, and supply chain security researchers.

Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects

Published in The 2026 23rd International Conference on Mining Software Repositories, 2026

Large language models (LLMs) have demonstrated the promise to revolutionize the field of software engineering. Among other things, LLM agents are rapidly gaining momentum in software development, with practitioners reporting a multifold increase in productivity after adoption. Yet, empirical evidence is lacking around these claims. In this paper, we estimate the causal effect of adopting a widely popular LLM agent assistant, namely Cursor, on development velocity and software quality. The estimation is enabled by a state-of-the-art difference-in-differences design comparing Cursor-adopting GitHub projects with a matched control group of similar GitHub projects that do not use Cursor. We find that the adoption of Cursor leads to a statistically significant, large, but transient increase in project-level development velocity, along with a substantial and persistent increase in static analysis warnings and code complexity. Further panel generalized-method-of-moments estimation reveals that increases in static analysis warnings and code complexity are major factors driving long-term velocity slowdown. Our study identifies quality assurance as a major bottleneck for early Cursor adopters and calls for it to be a first-class citizen in the design of agentic AI coding tools and AI-driven workflows.

talks

开源软件供应链管理

Published: July 21, 2022

teaching

Introduction to Computer Systems, Teaching Assistant, Fall 2018

Undergraduate Course, School of EECS, Peking University, 2018

Introducton to Computer Systems is an undergraduate course at Peking University. This course originates from the famous CMU 15-213 course. It includes a wide range of selected topics from system programming, computer organization, operating systems and networks. Up to 400 perspective students in computer science will take this course each year.

Introduction to Computation (C), Teaching Assistant, Fall 2020

Undergraduate Course, School of EECS, Peking University, 2020

Introducton to Computation (C) is an undergraduate course at Peking University. It is an introductory course to programming for students majoring in literal arts (literature, foreign language, history, etc).

Open Source Software Development and Technology, Teaching Assistant, Fall 2022

Undergraduate Course, School of EECS, Peking University, 2022

I am responsible for the design of all six labs and the course project and I am proud of my design! See our course GitHub repository for more details.

Hao He

Posts by Collection

publications

talks

teaching