For undergrads only
Opportunities for Undergraduate Research Experiences
The OURE program has been established to expand opportunities for a more active form of learning by students and to encourage the interaction of undergraduate students with faculty. It has helped the expansion of the level of research activity on campus, and has been able to demonstrate that teaching and research are compatible and mutually reinforcing. This effort is all in the hopes to help recruit superior students into our graduate programs.
Students participating in the OURE program should experience a foundational understanding of how research is conducted in their disciplines and to have a greater understanding of the information resources available and how to utilize these resources. They should also be familiar with how to interpret research outcomes and learn the fundamentals of experimental design.
For more in depth OURE information please visit the OURE website here.
Read Instructions Below Carefully - Closes May 1st - Click Here For Application
*Projects start at the beginning of Fall Semester and end at the conclusion of Spring semester each Academic Year
Each participant and faculty sponsor is to:
*The faculty sponsor agrees to supervise the work and certifies the educational value of the proposed project
Pick out one of our esteemed faculty members to get into contact with!
DR. ARIF ZAMAN
Developing Data Transfer Solutions for Next-Generation High-Performance Networks
Please send your inquiries for a project along with a resume/CV to marifuzzaman@mst.edu
Introduction:
Scientific applications and experiments in all areas of science are becoming increasingly network intensive as generated data needs to be moved to remote locations for processing, collaboration, and archival purposes. Despite the existence of high speed research networks up to 400Gbps, users often experience performance issues mainly due to scalability issues of existing data transfer applications. Besides the performance, existing transfer applications also suffer from reliability issues that can cause corrupt data to be accepted as authentic. This in turn motivated us to work on this area to understand the limitations of state-of-the-art methods and develop transformative solutions. Specifically, we developed several transfer optimization algorithms [1, 2, 3] that outperform existing solutions by up to an order of magnitude. Concurrently, to complement the data transfer optimization process, we developed network measurements [4, 5] and network performance anomaly detection algorithms [6, 7]. Moreover, we identified reliability issues in existing data integrity verification methods during transfer and proposed improvements to overcome them [8].
Modeling and Optimization of High-Speed Data Transfers:
We develop performance anomaly detection and optimization solutions for data transfers in highspeed research networks. To identify performance anomalies in real-time, we collect performance metrics from file systems, data transfer nodes, and network devices and process them using machine learning models to diagnose the underlying reasons for low transfer performance [6, 7]. We also develop optimization solutions to scale data transfers to high speeds. In previous studies, heuristic and supervised learning models were proposed to tackle this problem. Despite yielding better performance than state-of-the-art heuristic solutions (e.g, Globus), supervised machine learning models require extensive labeled data collection to perform well. This, however, is not feasible in all networks especially in production environments as it can have adverse impact on production workload. Therefore, we proposed online optimization algorithms (e.g., Gradient Descent and Bayesian) to tune transfer settings such as number of parallel TCP connections in real-time, thereby increasing higher network utilization [1]. Next we developed the centralized data transfer optimizer for more
efficient and stable concurrency allocation by extending this work [3]. However, we find that the monolithic design of existing transfer applications (including our proposed work) requires the same level of parallelism to be used for read, write, and network operations
during file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. Thereby we introduce modular file transfer architecture [2]
to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component.
Ongoing Research
Our previously proposed solutions utilized black-box optimization at the application level, where the optimizer operates without access to or assumptions about system utilization statistics. While this approach offers high portability, allowing deployment on any system regardless of administrative privileges, it also necessitates continuous optimization of application parameters. Now, we are working on designing an optimization algorithm that incorporates real-time performance metrics from end hosts and networks. The lightweight monitoring agents will gather a wide range of system metrics, such as I/O and network throughput/availability, background traffic, I/O contention, and CPU and memory usage. These metrics will then be used to estimate the optimal transfer parallelism that can strike a balance between performance and overhead. This data-driven approach aims to skip the time consuming search process to find the optimal concurrency level based on availability of system resources. The collection of real-time I/O and network usage metrics presents significant research and development challenges, yet, it holds the promise of substantially accelerating the optimization process, which in turn would increase network utilization.
DR. SUMAN MAITY
Please send inquiries for these projects and a resume/CV to smaity@mst.edu
Ideal Candidates: be proficient in Python/R, have an interest in data science, and be eager to learn new skills.
You will gain experience with methods in data science and meta-science, advance our understanding and have the opportunity to contribute to academic publications and open-source software.
Project Option 1: Understanding Biases in Large Language Models (LLMs)
Description-
Large Language Models (LLMs) like GPT-3.5, GPT-4 have demonstrated remarkable capabilities in natural language processing tasks and have gained significant attention in recent times. For instance, ChatGPT amassed over 100 million monthly active users, marking its place as one of the fastest-growing consumer internet applications in history within a mere 2 months of its launch. While these formidable models hold great promise in enhancing productivity and stimulating creativity, they are not without their pitfalls. One glaring concern is their propensity to manifest biases ingrained in the training data, potentially resulting in harmful and inequitable outputs. We plan to conduct an in-depth analysis of biases inherent in the LLMs. This involves identifying and categorizing biases, understanding their origins, and assessing their impact on various demographic groups. It is important to note that there are some prior research that has already addressed biases in LLMs in various dimensions. This project will do a comprehensive survey on the existing literature and provide a comparative analysis of the biases among various publicly accessible models (ChatGPT, Bard etc.). Additionally, we plan to discern how the temporal evolution of these models contributes to the mitigation or exacerbation of these biases.
Project Option 2: Building Comprehensive Dataset for Mentor-Mentee Relationship and Career Trajectory of Computer Scientists
Description-
Mentorship plays a vital role in the realm of scientific pursuits, influencing critical aspects such as the selection of research topics, career trajectory decisions, and the overall achievements of both mentees and mentors. Conventionally, scholars investigating mentorship lean on datasets derived from article co-authorship and doctoral dissertations. However, the existing datasets of this
nature tend to have a narrow focus on specific fields, omitting interactions in the early stages of careers and those unrelated to formal publications. In this project, we want to develop a comprehensive mentorship dataset tailored to the domain of computer science. To achieve this, we plan to use diverse, publicly accessible datasets, including the Microsoft Academic Graph (rebranded as OpenAlex) for publication records, ProQuest for doctoral dissertations, and the Academic Family Tree. Further, we want to curate publicly available information from various sources (CV, LinkedIn profiles etc.). To enhance the richness of our dataset, we plan to incorporate semantic representations of research by leveraging state-of-the-art representations provided by the LLMs. Given the increasing significance of gender and race dimensions in scrutinizing disparities within the scientific realm, we also aim to provide estimations concerning these
aspects. Our strategy encompasses validating the accuracy of matching profiles to publications, ensuring the fidelity of semantic content representation, and confirming the precision of demographic inferences. In essence, we are driven by the curiosity to unravel the intricate role that mentorship plays in shaping the trajectories of scientists' careers.
Project Option 3: Understanding Citation Imbalances and Gendered Citation Practices
Description-
Science has been experiencing vast gender imbalances in academic participation. Such inequalities have also been found in compensation, grant funding, hiring and promotions, authorship, and citations. Despite recent progress in these areas, the presence of disparities in scholarship engagement may result in long-term inequities in other areas. This imbalance can be attributed to the ‘Matilda effect’ in which men’s contributions are seen as more central and valued, whereas women's contributions are under appreciated and under-discussed. The study of citation dynamics is an important endeavor for understanding and addressing biases in science because of the potential downstream effects of inequitable engagement with women-led and men-led work. In this project, we are interested to study how citation imbalances might be amplified or reduced due to online visibility. We plan to leverage Altmetric dataset to measure online visibility. Our long-term goal is to understand the role social media (Twitter) plays in citation dynamics.
Follow Computer Science