NSF REU Summer Opportunity

For undergrads only

OURE Opportunity

Opportunities for Undergraduate Research Experiences

OURE Program Overview:

The OURE program has been established to expand opportunities for a more active form of learning by students and to encourage the interaction of undergraduate students with faculty. It has helped the expansion of the level of research activity on campus, and has been able to demonstrate that teaching and research are compatible and mutually reinforcing.  This effort is all in the hopes to help recruit superior students into our graduate programs.

 Students participating in the OURE program should experience a foundational understanding of how research is conducted in their disciplines and to have a greater understanding of the information resources available and how to utilize these resources. They should also be familiar with how to interpret research outcomes and learn the fundamentals of experimental design.

For more in depth OURE information please visit the OURE website here.

OURE Application Process

Read Instructions Below Carefully  - Closes May 1st - Click Here For Application

  • Previous OURE participants are eligible
  • Interested students should contact an OURE Departmental Coordinator
  • Late applications will not be accepted

*Projects start at the beginning of Fall Semester and end at the conclusion of Spring semester each Academic Year

Each participant and faculty sponsor is to:

  • Discuss their project
  • Come to an agreement
  • A cumulative grade point of 2.50 is required to participate in this program
  • Student must be a full time student with a least 12 credit hours per semester
  • Submit the completed application

*The faculty sponsor agrees to supervise the work and certifies the educational value of the proposed project

OURE: CS Department

Pick out one of our esteemed faculty members to get into contact with!

DR. ARIF ZAMAN

Developing Data Transfer Solutions for Next-Generation High-Performance Networks

Please send your inquiries for a project along with a resume/CV to marifuzzaman@mst.edu

Introduction: 

Scientific applications and experiments in all areas of science are becoming increasingly network intensive as generated data needs to be moved to remote locations for processing, collaboration, and archival purposes. Despite the existence of high speed research networks up to 400Gbps, users often experience performance issues mainly due to scalability issues of existing data transfer applications. Besides the performance, existing transfer applications also suffer from reliability issues that can cause corrupt data to be accepted as authentic. This in turn motivated us to work on this area to understand the limitations of state-of-the-art methods and develop transformative solutions. Specifically, we developed several transfer optimization algorithms [1, 2, 3] that outperform existing solutions by up to an order of magnitude. Concurrently, to complement the data transfer optimization process, we developed network measurements [4, 5] and network performance anomaly detection algorithms [6, 7]. Moreover, we identified reliability issues in existing data integrity verification methods during transfer and proposed improvements to overcome them [8].

Modeling and Optimization of High-Speed Data Transfers:

We develop performance anomaly detection and optimization solutions for data transfers in highspeed research networks. To identify performance anomalies in real-time, we collect performance metrics from file systems, data transfer nodes, and network devices and process them using machine learning models to diagnose the underlying reasons for low transfer performance [6, 7]. We also develop optimization solutions to scale data transfers to high speeds. In previous studies, heuristic and supervised learning models were proposed to tackle this problem. Despite yielding better performance than state-of-the-art heuristic solutions (e.g, Globus), supervised machine learning models require extensive labeled data collection to perform well. This, however, is not feasible in all networks especially in production environments as it can have adverse impact on production workload. Therefore, we proposed online optimization algorithms (e.g., Gradient Descent and Bayesian) to tune transfer settings such as number of parallel TCP connections in real-time, thereby increasing higher network utilization [1]. Next we developed the centralized data transfer optimizer for more
efficient and stable concurrency allocation by extending this work [3]. However, we find that the monolithic design of existing transfer applications (including our proposed work) requires the same level of parallelism to be used for read, write, and network operations
during file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. Thereby we introduce modular file transfer architecture [2]
to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component.

Ongoing Research

Our previously proposed solutions utilized black-box optimization at the application level, where the optimizer operates without access to or assumptions about system utilization statistics. While this approach offers high portability, allowing deployment on any system regardless of administrative privileges, it also necessitates continuous optimization of application parameters. Now, we are working on designing an optimization algorithm that incorporates real-time performance metrics from end hosts and networks. The lightweight monitoring agents will gather a wide range of system metrics, such as I/O and network throughput/availability, background traffic, I/O contention, and CPU and memory usage. These metrics will then be used to estimate the optimal transfer parallelism that can strike a balance between performance and overhead. This data-driven approach aims to skip the time consuming search process to find the optimal concurrency level based on availability of system resources. The collection of real-time I/O and network usage metrics presents significant research and development challenges, yet, it holds the promise of substantially accelerating the optimization process, which in turn would increase network utilization.

[1] Md Arifuzzaman and Engin Arslan. Online optimization of file transfers in high-speed networks.
In Proceedings of the International Conference for High Performance Computing, Networking,
Storage and Analysis, pages 1–13, 2021.
[2] Md Arifuzzaman and Engin Arslan. Use only what you need: Judicious parallelism for file transfers
in high performance networks. In Proceedings of the 37th ACM International Conference
on Supercomputing, 2023.
[3] Md Arifuzzaman, Brian Bockelman, James Basney, and Engin Arslan. Falcon: Fair and efficient
online file transfer optimization. IEEE Transactions on Parallel and Distributed Systems, 2023.
[4] Md Arifuzzaman and Engin Arslan. Swift and accurate end-to-end throughput measurements
for high speed networks. In The Network Traffic Measurement and Analysis Conference, 2022.
[5] Hemanta Sapkota, Md Arifuzzaman, and Engin Arslan. Sample transfer optimization with
adaptive deep neural network. In 2019 IEEE/ACM Innovating the Network for Data-Intensive
Science (INDIS), pages 69–76. IEEE, 2019.
2
[6] Md Arifuzzaman, Shafkat Islam, and Engin Arslan. Towards generalizable network anomaly
detection models. In 2021 IEEE 46th Conference on Local Computer Networks (LCN), pages
375–378, 2021.
[7] Md Arifuzzaman and Engin Arslan. Learning transfers via transfer learning. In 2021 IEEE
Workshop on Innovating the Network for Data-Intensive Science (INDIS), pages 34–43, 2021.
[8] Md Arifuzzaman, Masudul Bhuiyan, Mehmet Gumus, and Engin Arslan. Be smart, save i/o:
Probabilistic approach to avoid uncorrectable errors in storage systems. In IEEE International
Conference on Cluster Computing, 2022.

DR. SUMAN MAITY

Please send inquiries for these projects and a resume/CV to smaity@mst.edu 

Ideal Candidates: be proficient in Python/R, have an interest in data science, and be eager to learn new skills.
You will gain experience with methods in data science and meta-science, advance our understanding and have the opportunity to contribute to academic publications and open-source software.

Project Option 1: Understanding Biases in Large Language Models (LLMs)

Description-

Large Language Models (LLMs) like GPT-3.5, GPT-4 have demonstrated remarkable capabilities in natural language processing tasks and have gained significant attention in recent times. For instance, ChatGPT amassed over 100 million monthly active users, marking its place as one of the fastest-growing consumer internet applications in history within a mere 2 months of its launch. While these formidable models hold great promise in enhancing productivity and stimulating creativity, they are not without their pitfalls. One glaring concern is their propensity to manifest biases ingrained in the training data, potentially resulting in harmful and inequitable outputs. We plan to conduct an in-depth analysis of biases inherent in the LLMs. This involves identifying and categorizing biases, understanding their origins, and assessing their impact on various demographic groups. It is important to note that there are some prior research that has already addressed biases in LLMs in various dimensions. This project will do a comprehensive survey on the existing literature and provide a comparative analysis of the biases among various publicly accessible models (ChatGPT, Bard etc.). Additionally, we plan to discern how the temporal evolution of these models contributes to the mitigation or exacerbation of these biases.

Project Option 2: Building Comprehensive Dataset for Mentor-Mentee Relationship and Career Trajectory of Computer Scientists

Description-

Mentorship plays a vital role in the realm of scientific pursuits, influencing critical aspects such as the selection of research topics, career trajectory decisions, and the overall achievements of both mentees and mentors. Conventionally, scholars investigating mentorship lean on datasets derived from article co-authorship and doctoral dissertations. However, the existing datasets of this
nature tend to have a narrow focus on specific fields, omitting interactions in the early stages of careers and those unrelated to formal publications. In this project, we want to develop a comprehensive mentorship dataset tailored to the domain of computer science. To achieve this, we plan to use diverse, publicly accessible datasets, including the Microsoft Academic Graph (rebranded as OpenAlex) for publication records, ProQuest for doctoral dissertations, and the Academic Family Tree. Further, we want to curate publicly available information from various sources (CV, LinkedIn profiles etc.). To enhance the richness of our dataset, we plan to incorporate semantic representations of research by leveraging state-of-the-art representations provided by the LLMs. Given the increasing significance of gender and race dimensions in scrutinizing disparities within the scientific realm, we also aim to provide estimations concerning these
aspects. Our strategy encompasses validating the accuracy of matching profiles to publications, ensuring the fidelity of semantic content representation, and confirming the precision of demographic inferences. In essence, we are driven by the curiosity to unravel the intricate role that mentorship plays in shaping the trajectories of scientists' careers.

Project Option 3: Understanding Citation Imbalances and Gendered Citation Practices

Description-

Science has been experiencing vast gender imbalances in academic participation. Such inequalities have also been found in compensation, grant funding, hiring and promotions, authorship, and citations. Despite recent progress in these areas, the presence of disparities in scholarship engagement may result in long-term inequities in other areas. This imbalance can be attributed to the ‘Matilda effect’ in which men’s contributions are seen as more central and valued, whereas women's contributions are under appreciated and under-discussed. The study of citation dynamics is an important endeavor for understanding and addressing biases in science because of the potential downstream effects of inequitable engagement with women-led and men-led work. In this project, we are interested to study how citation imbalances might be amplified or reduced due to online visibility. We plan to leverage Altmetric dataset to measure online visibility. Our long-term goal is to understand the role social media (Twitter) plays in citation dynamics.

Suman Kalyan Maity has recently joined the Department of Computer Science at Missouri University of Science and Technology as an Assistant Professor in Fall 2023. Prior to this, he was a postdoctoral research associate at the MIT Brain and Cognitive Sciences (BCS) and MIT Center for Research on Equitable and Open Scholarship (CREOS) hosted by Prof. Roger Levy. He received his PhD in Computer Science and Engineering from Indian Institute of Technology Kharagpur. He was also the recipient of IBM PhD Fellowship and Microsoft Research India PhD Fellowship Award. His research interests lie in the interdisciplinary area of Social Data Science, where he investigates
the mechanisms and dynamics of complex social systems. His research on social media focuses on understanding how various sociolinguistic phenomena emerge, are adopted; used and misused. He has contributed in understanding the detection of hate content, misinformation and adoption of mitigation strategies to stop spread of negativity on social media platforms. He is also interested in the broad areas of Science of Science where he studies the global landscape of scientific research funding, publications and their interplay; risk taking behavior, the impact of success or failure on professional career, and various aspects of Open Science Communication –
Citation bias and inclusivity, Open Peer Review etc.