About
Munif is a software engineer and data scientist with 6 years of
experience in building solutions for data analytics projects. His exper-
tise is in developing large-scale data engineering systems with cloud
orchestration, interactive visualizations in web dashboards, monitoring
and alerting systems for complex computer hardware, highly automated
data pipelines for processing hardware testing datasets, analytics
workflows for machine-generated logs, and investigation of noisy time-
series datasets from hardware telemetry. His educational training
includes a PhD in data science where he studied building complex data
products which combine effective ETL techniques with modern machine
learning and network science applications. He also has a background in
Electrical Engineering, where his areas of focus were digital sensors
and signal processing. He posesses state-of-the-art familiarity with
data analytics and engineering tools such as Python (Pandas, NumPy,
SciPy, Scikit-learn, Plotly, Matplotlib, Seaborn), shell scripting, SQL,
Airflow, and Elasticsearch, as well as cloud deployment tools (AWS and
GCP).
Skills
- Data Analysis and Scripting: Python (Pandas, NumPy,
Airflow, PySpark, NetworkX, SciPy, Scikit-learn, TensorFlow, PyTorch),
Unix Shell (bash/zsh).
- Automation and DevOps: Airflow, tmux, cron, Git,
AWS, GCP, Jira.
- ETL: PostgreSQL, TimeScaleDB, InfluxDB, MongoDB,
Elasticsearch, Logstash.
- Visualization: Plotly/Dash, Matplotlib/Seaborn,
Vega.js, d3.js.
- Dashboarding, Reporting, and Documentation:
Jupyter, Grafana, Redash, Kibana, rST, Markdown, .
- Other: JavaScript, MATLAB, C, C++, C#
Experience
Cerebras Systems
Sunnyvale, CA
Senior Data Scientist (Data Analytics) Apr 2022 –
Present
- Architected and developed ETL data pipelines and automations
centered on real-time telemetry and testing data collection,
preprocessing, and analysis using Python, Pandas, Airflow,
PostgreSQL, TimeScaleDB, InfluxDB, Git, and
AWS infrastructure-as-code mechanisms.
- Developed design documents, implemented agile project management on
Jira, and documented pipelines on internal Sphinx
webserver with rST.
- Executed exploratory and investigative data analysis on noisy
telemetry and hardware testing datasets to identify anomalous datapoints
and variation patterns with Pandas, SQL, NumPy, and
SciPy. Communicated findings visually with
Jupyter and Plotly to stakeholders and
provided recommendations for adjustment of testing and operation
parameters.
- Improved design of deployed data pipelines, improving processing and
reporting times by up to 97-99%. Monitored the execution of pipelines
with Airflow, AWS CloudWatch, and Slack
alerts as designated owner and performed effective debug
analysis.
- Deployed statistical outlier detection methods with
Pandas and SQL for complex hardware
testing datasets and provided data consumers with interactive interfaces
for adjustability of detection thresholds.
- Deployed the open-source Elasticsearch-Kibana-Logstash
(ELK) stack to provide tools and Kibana QL
workflows for collecting, organizing, analyzing, and reporting on
machine-generated logs from Linux systems and unstructured text data
from tests.
- Developed scalable user-friendly dashboards, interactive
visualizations, and reporting utilities for hardware testing and
monitoring datasets using Plotly Dash web apps on
AWS servers.
- Negotiated stakeholder relationships across the organization to
understand data producer and consumer requirements and inculcate a
data-driven culture across engineering teams. Led training and
communication efforts to improve engineering workflows. Mentored junior
professionals through close collaboration.
Philadelphia, PA
Data Scientist, Doctoral Researcher (CODED Lab ) Sep 2017 – Mar
2022
- Built a prototype news aggregator that combined graph partitioning
algorithms, transformer neural language models (MiniLM/DistilBERT), and
clustering techniques (HDBSCAN) to group, summarize, and connect news
stories and social conversations with PyTorch, PySpark,
and scikit-learn. Built a terabyte-scale real-time ETL
system on top of public APIs and constructed a graph dataset of 3M news
articles and 1.5B social posts with MongoDB and
NetworkX. Published as PhD dissertation and multiple
papers, presented at the HICSS conference.
- Developed continuous activity tracking analytics tools to monitor
malicious coordinated campaigns and botnets on 3M social accounts with
scikit-learn, SciPy, and Plotly.
Presented at IC2S2.
- Developed supervised machine learning classifier with
scikit-learn and NumPy for bot
detection on social platforms using a veracity-annotated news discussion
dataset consisting of 1.6M user-generated text records. Published and
presented at ICWSM.
- Designed a prototype conversational agent incorporating a neural NLP
pipeline (GPT-2 based) for mental health support using
PyTorch and DeepSpeech as well as a
Raspberry Pi client. Published in JMIR Human Factors
and presented at EAI PervasiveHealth.
- Developed a low-cost frequency-based NLP algorithm for automatic
expansion of a relational dataset of medical terms using user-generated
text corpora with NumPy and SpaCy.
Published at ICHI.
Semion Inc.
Dhaka, Bangladesh / Berkeley, CA
Machine Learning Engineer, NLP Jul 2016 – Jul
2017
Built business applications with state-of-the-art deep learning
technologies. Developed a sequential CNN-based system for legal document
discovery. Prototyped named entity recognition for financial documents
utilizing LSTM and CRF models. Developed NLP and speech recognition
components of voice-activated electronic health records software for
medical professionals.
Education
Drexel University, Philadelphia, PA
Ph.D., Data Science 2022
Combining software engineering, statistical analysis, and computational
modeling with sociotechnical systems theory from complex systems,
computational social science, network science, and communication
literature to study the design and application of social data products
leveraging large datasets, machine learning, and network analysis. Focus
on data science for social good: analysis of malicious activity in
social networks such as bots, misinformation, and coordinated campaigns
as well as development of software tools for user empowerment.
Bangladesh University of Engineering and
Technology, Dhaka, Bangladesh
B.Sc., Electrical and Electronic Engineering 2016
Developed a support vector machine (SVM) based machine learning system
for forensic location classification from electrical network frequency
data that won the IEEE Signal Processing Cup at ICASSP 2016. Developed
an automatic household electrical load monitoring system utilizing power
signature classification techniques with real-time user notifications
and analytics.
Publications
- Mujib, M. I., Zelenkauskaite, A., & Williams,
J. R. (2023). Which tweets ‘deserve’ to be included in news stories?
Chronemics of tweet embedding. In Proceedings of the 56th Annual
Hawaii International Conference on System Sciences (HICSS56).
- Smtriti, D., Kao, T. S., Rathod, R., Shin, J. Y., Peng, W.,
Williams, J. R., Mujib, M. I., Colosimo, M., &
Huh-Yoo, J. (2022) MICA: Motivational Interviewing Conversational Agent
for Parents as Proxies for Their Children in Healthy Eating. JMIR
Human Factors, 06/08/2022:38908.
- Mujib, M. I. (2022). Modeling Emerging News Stories
Across Digital Publications and Social Media. Drexel University,
Philadephia. Doctoral dissertation.
- Wang, L., Mujib, M. I., Williams, J., Demiris, G.,
& Huh-Yoo, J. (2021). An Evaluation of Generative Pre-Training
Model-based Therapy Chatbot for Caregivers. arXiv preprint,
arXiv:2107.13115.
- Mujib, M. I., Heidenreich, H. S., Murphy, C. J.,
Santia, G. C., Zelenkauskaite, A., & Williams, J. R. (2020, August).
NewsTweet: A Dataset of Social Media Embedding in Online Journalism.
arXiv preprint, arXiv:2008.02870.
- Heidenreich, H. S., Mujib, M. I., & Williams,
J. R. (2020, July) Investigating Coordinated ‘Social’ Targeting of
High-Profile Twitter Accounts. arXiv preprint,
arXiv:2008.02874. Presented at the 6th International Conference on
Computational Social Science (IC2S2).
- Smriti, D., Shin, J. Y., Mujib, M. I., Colosimo,
M., Kao, T. S., Williams, J., & Huh-Yoo, J. (2020, May). TAMICA:
Tailorable Autonomous Motivational Interviewing Conversational Agent. In
Proceedings of the 14th EAI International Conference on Pervasive
Computing Technologies for Healthcare (pp. 411-414).
- Santia, G. C., Mujib, M. I., & Willams, J. R.
(2019, June). Detecting Social Bots on Facebook in an Information
Veracity Context. In Proceedings of the International AAAI
Conference on Web and Social Media (ICWSM) Vol. 13 (pp. 463-472).
AAAI.
- Mujib, M. I., Yang, C. C., Zhao, M., &
Williams, J. R. (2018, June). Expanding Consumer Health Vocabularies
with Frequency-Conserving Internal Context Models. In 2018 IEEE
International Conference on Healthcare Informatics (ICHI)
(pp. 241-246). IEEE.