research in data engineering

Pictogram collage with clouds, pie chart and graphs

Published: 31 May, 2024 Contributors: Ivan Belcic, Cole Stryker

Data engineering is the practice of designing and building systems for the aggregation, storage and analysis of data at scale. Data engineers empower organizations to get insights in real time from large datasets.

From social media and marketing metrics to employee performance statistics and trend forecasts, enterprises have all the data they need to compile a holistic view of their operations. Data engineers transform massive quantities of data into valuable strategic findings.

With proper data engineering, stakeholders across an organization—executives, developers, data scientists and business intelligence (BI) analysts—can access the datasets they need at any time in a manner that is reliable, convenient and secure.

Organizations have access to more data—and more data types—than ever before. Every bit of data can potentially inform a crucial business decision. Data engineers govern data management for downstream use including analysis, forecasting or machine learning.

As specialized computer scientists, data engineers excel at creating and deploying algorithms, data pipelines and workflows that sort raw data into ready-to-use datasets. Data engineering is an integral component of the modern data platform and makes it possible for businesses to analyze and apply the data they receive, regardless of the data source or format.

Even under a decentralized data mesh management system, a core team of data engineers is still responsible for overall infrastructure health.

In this guide, we share 10 strategies for how to build a data pipeline plan, drawn from dozens of years of our own team’s experiences.

Data Engineering Foundations Course

Data engineers have a range of day-to-day responsibilities. Here are several key use cases for data engineering:

Data engineers streamline data intake and storage across an organization for convenient access and analysis. This facilitates scalability by storing data efficiently and establishing processes to manage it in a way that is easy to maintain as a business grows. The field of DataOps automates data management and is made possible by the work of data engineers.

With the right data pipelines in place, businesses can automate the processes of collecting, cleaning and formatting data for use in data analytics. When vast quantities of usable data are accessible from one location, data analysts can easily find the information they need to help business leaders learn and make key strategic decisions.

The solutions that data engineers create set the stage for real-time learning as data flows into data models that serve as living representations of an organization's status at any given moment.

Machine learning (ML) uses vast reams of data to train artificial intelligence (AI) models and improve their accuracy. From the product recommendation services seen in many e-commerce platforms to the fast-growing field of generative AI (gen AI), ML algorithms are in widespread use. Machine learning engineers rely on data pipelines to transport data from the point at which it is collected to the models that consume it for training.

Data engineers build systems that convert mass quantities of raw data into usable core data sets containing the essential data their colleagues need. Otherwise, it would be extremely difficult for end users to access and interpret the data spread across an enterprise's operational systems.

Core data sets are tailored to a specific downstream use case and designed to convey all the required data in a usable format with no superfluous information. The three pillars of a strong core data set are:

The data as a product (DaaP) method of data management emphasizes serving end users with accessible, reliable data. Analysts, scientists, managers and other business leaders should encounter as few obstacles as possible when accessing and interpreting data.

Good data isn't just a snapshot of the present—it provides context by conveying change over time. Strong core data sets will showcase historical trends and give perspective to inform more strategic decision-making.

Data integration is the practice of aggregating data from across an enterprise into a unified dataset and is one of the primary responsibilities of the data engineering role. Data engineers make it possible for end users to combine data from disparate sources as required by their work.

Data engineering governs the design and creation of the data pipelines that convert raw, unstructured data into unified datasets that preserve data quality and reliability.

Data pipelines form the backbone of a well-functioning data infrastructure and are informed by the data architecture requirements of the business they serve. Data observability is the practice by which data engineers monitor their pipelines to ensure that end users receive reliable data.

The data integration pipeline contains three key phases:

Data ingestion is the movement of data from various sources into a single ecosystem. These sources can include databases, cloud computing platforms such as Amazon Web Services (AWS), IoT devices, data lakes and warehouses, websites and other customer touchpoints. Data engineers use APIs to connect many of these data points into their pipelines.

Each data source stores and formats data in a specific way, which may be structured or unstructured . While structured data is already formatted for efficient access, unstructured data is not. Through data ingestion, the data is unified into an organized data system ready for further refinement.

Data transformation prepares the ingested data for end users such as executives or machine learning engineers. It is a hygiene exercise that finds and corrects errors, removes duplicate entries and normalizes data for greater data reliability . Then, the data is converted into the format required by the end user.

Once the data has been collected and processed, it’s delivered to the end user. Real-time data modeling and visualization, machine learning datasets and automated reporting systems are all examples of common data serving methods.

Data engineering, data science, and data analytics are closely related fields. However, each is a focused discipline filling a unique role within a larger enterprise. These three roles work together to ensure that organizations can make the most of their data.

Data scientists use machine learning, data exploration and other academic fields to predict future outcomes. Data science is an interdisciplinary field focused on making accurate predictions through algorithms and statistical models. Like data engineering, data science is a code-heavy role requiring an extensive programming background.

Data analysts examine large datasets to identify trends and extract insights to help organizations make data-driven decisions today. While data scientists apply advanced computational techniques to manipulate data, data analysts work with predefined datasets to uncover critical information and draw meaningful conclusions.

Data engineers are software engineers who build and maintain an enterprise’s data infrastructure—automating data integration, creating efficient data storage models and enhancing data quality via pipeline observability. Data scientists and analysts rely on data engineers to provide them with the reliable, high-quality data they need for their work.

The data engineering role is defined by its specialized skill set. Data engineers must be proficient with numerous tools and technologies to optimize the flow, storage, management and quality of data across an organization.

When building a pipeline, a data engineer automates the data integration process with scripts—lines of code that perform repetitive tasks. Depending on their organization's needs, data engineers construct pipelines in one of two formats: ETL or ELT.

ETL: extract, transform, load

ETL pipelines automate the retrieval and storage of data in a database. The raw data is extracted from the source, transformed into a standardized format by scripts and then loaded into a storage destination. ETL is the most commonly used data integration method, especially when combining data from multiple sources into a unified format.

ELT: extract, load, transform

ELT pipelines extract raw data and import it into a centralized repository before standardizing it through transformation. The collected data can later be formatted as needed on a per use basis, offering a higher degree of flexibility than ELT pipelines.

The systems that data engineers create often begin and end with data storage solutions: harvesting data from one location, processing it and then depositing it elsewhere at the end of the pipeline.

Cloud computing services

Proficiency with cloud computing platforms is essential for a successful career in data engineering. Microsoft Azure Data Lake Storage, Amazon S3 and other AWS solutions, Google Cloud and IBM Cloud ® are all popular platforms.

Relational databases

A relational database organizes data according to a system of predefined relationships. The data is arranged into rows and columns that form a table conveying the relationships between the data points. This structure allows even complex queries to be performed efficiently.

Analysts and engineers maintain these databases with relational database management systems (RDBMS). Most RDBMS solutions use SQL for handling queries, with MySQL and PostgreSQL as two of the leading open source RDBMS options.

NoSQL databases

SQL isn’t the only option for database management. NoSQL databases enable data engineers to build data storage solutions without relying on traditional models. Since NoSQL databases don’t store data in predefined tables, they allow users to work more intuitively without as much advance planning. NoSQL offers more flexibility along with easier horizontal scalability when compared to SQL-based relational databases.

Data warehouses

Data warehouses collect and standardize data from across an enterprise to establish a single source of truth. Most data warehouses consist of a three-tiered structure: a bottom tier storing the data, a middle tier enabling fast queries and a user-facing top tier. While traditional data warehousing models only support structured data, modern solutions can store unstructured data.

By aggregating data and powering fast queries in real-time, data warehouses enhance data quality, provide quicker business insights and enable strategic data-driven decisions. Data analysts can access all the data they need from a single interface and benefit from real-time data modeling and visualization.

While a data warehouse emphasizes structure, a data lake is more of a freeform data management solution that stores large quantities of both structured and unstructured data. Lakes are more flexible in use and more affordable to build than data warehouses as they lack the requirement for predefined schema.

Data lakes house new, raw data, especially the unstructured big data ideal for training machine learning systems. But without sufficient management, data lakes can easily become data swamps: messy hoards of data too convoluted to navigate.

Many data lakes are built on the Hadoop product ecosystem, including real-time data processing solutions such as Apache Spark and Kafka.

Data lakehouses

Data lakehouses are the next stage in data management. They mitigate the weaknesses of both the warehouse and lake models. Lakehouses blend the cost optimization of lakes with the structure and superior management of the warehouse to meet the demands of machine learning, data science and BI applications.

As a computer science discipline, data engineering requires an in-depth knowledge of various programming languages. Data engineers use programming languages to construct their data pipelines.

SQL or structured querying language, is the predominant database creation and manipulation programming language. It forms the basis for all relational databases and may be used in NoSQL databases as well.

Python offers a wide range of prebuilt modules to speed up many aspects of the data engineering process, from building complex pipelines with Luigi to managing workflows with Apache Airflow. Many user-facing software applications use Python as their foundation.

Scala is a good choice for use with big data as it meshes well with Apache Spark. Unlike Python, Scala permits developers to program multiple concurrency primitives and simultaneously execute several tasks. This parallel processing ability makes Scala a popular choice for pipeline construction.

Java is a popular choice for the backend of many data engineering pipelines. When organizations opt to build their own in-house data processing solutions, Java is often the programming language of choice. It also underpins Apache Hive, an analytics-focused warehouse tool.

IBM Databand is observability software for data pipelines and warehouses that automatically collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues.

Solve inefficient data-generation and processing problems and improve poor data quality caused by errors and inconsistencies with IBM DataOps platforms.

IBM Cloud Pak for Data is a modular set of integrated software components for data analysis, organization and management. It is available for self-hosting or as a managed service on IBM Cloud.

Even if you’re on the data team, keeping track of all the different roles and their nuances can get confusing—let alone if you’re a non-technical executive who’s supporting or working with the team. One of the biggest areas of confusion is understanding the differences between data engineer, data scientist and analytics engineer roles.

Data integration stands as a critical first step in constructing any artificial intelligence (AI) application. While various methods exist for starting this process, organizations accelerate the application development and deployment process through data virtualization.

The emergence of generative AI prompted several prominent companies to restrict its use because of the mishandling of sensitive internal data. According to CNN, some companies imposed internal bans on generative AI tools while they seek to better understand the technology and many have also blocked the use of internal ChatGPT.

Implement proactive data observability with IBM Databand today—so you can know when there’s a data health issue before your users do.

Getting Published
Open Research
Communicating Research
Life in Research
For Editors
For Peer Reviewers
Research Integrity

Data Science and Engineering: Research Areas

Author: guest contributor.

Data science has emerged as an independent domain in the decade starting 2010 with the explosive growth in big data analytics, cloud, and IoT technology capabilities. A data scientist requires fundamental knowledge in the areas of computer science, statistics, and machine learning, which he may use to solve problems in a variety of domains. We may define data science as a study of scientific principles that describe data and their inter-relationship. Some of the current areas of research in Data Science and Engineering are categorized and enumerated below :

Data Science and Engineering – Research Areas © Springernature 2023

1. Artificial Intelligence / Machine Learning :

While human beings learn from experience, machines learn from data and improve their accuracy over time. AI applications attempt to mimic human intelligence by a computer, robot, or other machines. AI/ML has brought disruptive innovations in business and social life. One of the emerging areas in AI is generative artificial intelligence algorithms that use reinforcement learning for content creation such as text, code, audio, images, and videos. The AI based chatbot ‘ChatGPT’ from Open AI is a product in this line. ChatGPT can code computer programs, compose music, write short stories and essays, and much more!

2. Automation:

Some of the research areas in automation include public ride-share services (e.g., uber platform), self-driving vehicles, and automation of the manufacturing industry. AI/ML techniques are widely used in industries for the identification of unusual patterns in sensor readings from machinery and equipment for the detection or prevention of malfunction.

3. Business:

As we know, social media provide opportunities for people to interact, share, and participate in numerous activities in a massive way. A marketing researcher may analyze this data to gain an understanding of human sentiments and behavior unobtrusively, at a scale unheard of in traditional marketing. We come across personalized product recommender systems almost every day. Content-based recommender systems guess user’s intentions based on the history of their previous activities. Collaborative recommender systems use data mining techniques to make personalized product recommendations, during live customer transactions, based on the opinions of customers with similar profile.

Data science finds numerous applications in finance like stock market analysis; targeted marketing; and detection of unusual transaction patterns, fraudulent credit card transactions, and money laundering. Financial markets are complex and chaotic. However, AI technologies make it possible to process massive amounts of real-time data, leading to accurate forecast and trade. Stock Hero, Scanz, Tickeron, Impertive execution, and Algoriz are some of the AI based products for stock market prediction.

4. Computer Vision and NLP:

AI/ML models are extensively used in digital image processing, computer vision, speech recognition, and natural language processing (NLP). In image processing, we use mathematical transformations to enhance an image. These transformations typically include smoothing, sharpening, contrasting, and stretching. From the transformed images we can extract various types of features - edges, corners, ridges, and blobs/regions. The objective of computer vision is to identify objects (or images). To achieve this, the input image is processed, features are extracted, and using the features the object is classified (or identified).

Natural language processing techniques are used to understand human language in written or spoken form and translate it to another language or respond to commands. Voice-operated GPS systems, translation tools, speech-to-text dictation, and customer service chatbots are all applications of NLP. Siri, and Alexa are popular NLP products.

5. Data Mining

Data mining is the process of cleaning and analyzing data to identify hidden patterns and trends that are not readily discernible from a conventional spread sheet. Building models for classification and clustering in high dimensional, streaming, and/or big data space is an area that receives much attention from researchers. Network-graph based algorithms are being developed for representing and analyzing the interactions in social media such as facebook, twitter, linkedin, instagram, and web sites.

6. Data Management:

Information storage and retrieval is area that is concerned with effective and efficient storage and retrieval of digital documents in multiple data formats, using their semantic content. Government regulations and individual privacy concerns necessitate cryptographic methods for storing and sharing data such as secure multi-party computation, homomorphic encryption, and differential privacy.

Data-stream processing needs specialized algorithms and techniques for doing computations on huge data that arrive fast and require immediate processing – e.g., satellite images, data from sensors, internet traffic, and web searches. Some of the other areas of research in data management include big data databases, cloud computing architectures, crowd sourcing, human-machine interaction, and data governance.

7. Data visualization

Visualizing complex, big, and / or streaming data, such as the onset of a storm or a cosmic event, demands advanced techniques. In data visualization, the user usually follows a three-step process - get an overview of the data, identify interesting patterns, and drill-down for final details. In most cases, the input data is subjected to mathematical transformations and statistical summarizations. The visualization of the real physical world may be further enhanced using audio-visual techniques or other sensory stimuli delivered by technology. This technique is called augmented reality. Virtual reality provides a computer-generated virtual environment giving an immersive experience to the users. For example, ‘Pokémon GO’ that allows you play the game Pokémon is an AR product released in 2016; Google Earth VR is VR product that ‘puts the whole world within your reach’.

8. Genetic Studies:

Genetic studies are path breaking investigation of the biological basis of inherited and acquired genetic variation using advanced statistical methods. The human genome project (1990 – 2003) produced a genome sequence that accounted for over 90% of the human genome. The project cost was about USD 3 billion. The data underlying a single human genome sequence is about 200 gigabytes. The digital revolution has made astounding possibilities to pinpoint human evolution with marked accuracy. Note that the cost of sequencing the entire genome of a human cell has fallen from USD 100,000,000 in the year 2000 to USD 800 in 2020!

9. Government:

Governments need smart and effective platforms for interacting with citizens, data collection, validation, and analysis. Data driven tools and AI/ML techniques are used for fighting terrorism, intervention in street crimes, and tackling cyber-attack. Data science also provides support in rendering public services, national and social security, and emergency responses.

10. Healthcare:

The most important contribution of data science in the pharmaceutical industry is to provide computational support for cost effective drug discovery using AI/ML techniques. AI/ML supports medical diagnosis, preventive care, and prediction of failures based on historical data. Study of genetic data helps in the identification of anomalies, prediction of possible failures and personalized drug suggestions, e.g., in cancer treatment. Medical image processing use data science techniques to visualize, interrogate, identify, and treat deformities in the internal organs and systems.

Electronic health records (EHR) are concerned with the storage of data arriving in multiple formats, data privacy (e.g., conformance with HIPAA privacy regulations), and data sharing between stakeholders. Wearable technology provides electronic devices and platforms for collecting and analyzing data related to personal health and exercise – for example, Fitbit and smartwatches. The Covid-19 pandemic demonstrated the power of data science in monitoring and controlling an epidemic as well as developing drugs in record time.

11. Responsible AI :

AI systems support complex decision making in various domains such as autonomous vehicles, healthcare, public safety, HR practices etc. To trust the AI systems, their decisions must be reliable, explainable, accountable, and ethical. There is ongoing research on how these facets can be built into AI algorithms.

This book appears in the book series Transactions on Computer Systems and Networks . If you are interested in writing a book in the series, then please click here to complete and submit the relevant form.

Srikrishnan Sundararajan © springernature 2023

Srikrishnan Sundararajan, PhD in Computer Applications, is a retired senior professor of business analytics, Loyola institute of business administration, Chennai, India. He has held various tenured and visiting professorships in Business Analytics, and Computer Science for over 10 years. He has 25 years of experience as a consultant in the information technology industry in India and the USA, in information systems development and technology support.

He is the author of the forthcoming book ‘Multivariate Analysis and Machine Learning Techniques - Feature Analysis in Data Science using Python’ published by Springer Nature (ISBN.9789819903528). This book offers a comprehensive first-level introduction to data science including python programming, probability and statistics, multivariate analysis, survival analysis, AI/ML, and other computational techniques.

Guest Contributors include Springer Nature staff and authors, industry experts, society partners, and many others. If you are interested in being a Guest Contributor, please contact us via email: [email protected] .

Open science
Tools & Services
Account Development
Sales and account contacts
Professional
Press office
Locations & Contact

We are a world leading research, educational and professional publisher. Visit our main website for more information.

© 2024 Springer Nature
General terms and conditions
Your US State Privacy Rights
Your Privacy Choices / Manage Cookies
Accessibility
Legal notice
Help us to improve this site, send feedback.

Four Generations in Data Engineering for Data Science

The Past, Presence and Future of a Field of Science

Fachbeitrag
Open access
Published: 22 December 2021
Volume 22 , pages 59–66, ( 2022 )

Cite this article

You have full access to this open access article

Meike Klettke ORCID: orcid.org/0000-0003-0551-8389 1 &
Uta Störl 2

3979 Accesses

5 Citations

Explore all metrics

Data-driven methods and data science are important scientific methods in many research fields. All data science approaches require professional data engineering components. At the moment, computer science experts are needed for solving these data engineering tasks. Simultaneously, scientists from many fields (like natural sciences, medicine, environmental sciences, and engineering) want to analyse their data autonomously. The arising task for data engineering is the development of tools that can support an automated data curation and are utilisable for domain experts. In this article, we will introduce four generations of data engineering approaches classifying the data engineering technologies of the past and presence. We will show which data engineering tools are needed for the scientific landscape of the next decade.

Lessons Learned from Challenging Data Science Case Studies

Efficient Data Management for Putting Forward Data Centric Sciences

Data Integration, Management, and Quality: From Basic Research to Industrial Application

Explore related subjects.

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

“Drowning in Data, Dying of Thirst for Knowledge” This often used quote describes the main problems of data science: the necessity to draw useful knowledge from data and simultaneously the main aim of the data engineering field: providing data for analysis. In these dedicated application fields different kinds of data are collected and generated that shall be analysed with data mining methods . In this article, we use the term data mining in the broad interpretation synonymous to knowledge discovery in databases which is “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or relationships within a dataset in order to make important decisions” (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Even though in recent times the focus has been on artificial neural network algorithms, the entire range of data mining methods also includes clustering, classification, regression, association rules and so on.

This article, however, will mainly focus on the data preprocessing part of data science. Data engineering components have to read the data from very large data sources in different heterogeneous data formats and integrate the data into the target data format. In this process, data are validated, cleaned, completed, aggregated, transformed and integrated. The tools for the data engineering tasks have a long tradition in the classical database research field. For more than 50 years database management systems have been used to store large amounts of structured data. Over time, these systems have been extended and redeveloped among different dimensions:

to handle increasing volume of data,

to be able to store data in different data models (besides the relational data model also considering graph data model, streaming data, JSON data model) and to be able to transform data between these different models,

to consider the heterogeneity of data, and

to treat incompleteness and vagueness of datasets.

In data science applications, an additional requirement comes up: the wish that domain experts will be able to analyse their own data. Under the term democratising of machine learning the requirement has been exposed that lowering entry barriers for domain experts analysing their own data is necessary [ 37 ].

All above enumerated dimensions have determined the data engineering research landscape. This article will introduce a systematic classification of the field.

The rest of the article is structured as follows. In Sect. 2 , a classification of data engineering methods will be introduced and four generations will be defined. Each of these generations represents a very active body of research. Thus, in Sect. 3 , a comprehensive outlook on open research questions in all generations is given.

2 Classification of Data Engineering Methods

In this article, available data engineering methods for data science applications will be classified. The main contribution of the article is a systematic overview of achievements in this research field till now (First, Second, and Third Generation), the open research questions in the present (mainly in the Third Generation) and the requirements that will have to be met for the future development of the area (Fourth Generation).

The term generation does not mean that one generation replaces the other, but that one generation is based on the previous ones. With it no valuation is implied, but rather a temporal order once the developments began. In the following this classification of data engineering approaches will be introduced.

2.1 First Generation: Data Preprocessing

Database technology and tools for providing structured data have been available for more than 50 years. The term “data engineering” came up later. It summarises methods to provide data for business intelligence, data science analysis, and machine learning algorithms – the so-called data preprocessing.

First Generation: Data Engineering as part of Data Science

Fig. 1 visualises data engineering as part of the data science process. This follows the observation that “data preprocessing is an often neglected but major step in the data mining process” [ 11 ]. In all real data science applications, it has been considered that data engineering is the most time-consuming subtask, estimates put the percentage at 60–80% of the total effort Footnote 1 . Reasons for this are that data preprocessing starts from scratch with each new application, a high manual effort is required which explains why it is so time-consuming, expensive and error-prone. In all real applications, data preprocessing is much more complicated than expected and numerous data quality problems, exceptions and outliers can often be found in the datasets.

Because of the high amount of efforts necessary, data preprocessing has been established as its own science field and the term data engineering has been used for all subtasks. The high manual effort of data engineering tasks leads to the necessity of tool support. The First Generation of data engineering tools has been developed to solve different parts either to increase the data quality or to transform the data into a necessary target format. Some of the data engineering subtasks are:

Data Understanding and Data Profiling

Data Exploration

Schema Extraction

Column Type Inference

Inference of Integrity Constraints/Pattern

Cleaning and Data Correction

Outlier Detection and Correction

Duplicate Elimination

Missing Value Imputation

Data Transformation

Matching and Mapping

Datatype Transformation

Transformation between different Data Models

Data Integration

Solutions for many data models are either based on “classic approaches” or apply machine learning algorithms to solve preprocessing tasks. In this section, an overview of some of these available approaches will be given.

There are several tutorials and textbooks that present the current state-of-the-art in the dedicated subtasks, e.g. [ 5 , 11 , 19 , 29 ] to mention only some of these.

Data engineering of unstructured or (partially) unknown data sources often starts with data profiling [ 1 ]. The aim is to explore and understand the data and to derive data characteristics. Tools for data exploration give an overview of data structures, attributes, domains, regularity of data, null values, and so on, e.g. [ 27 ] for NoSQL data, in [ 7 ] a query-based approach has been suggested and in [ 17 ] an overview of available methods is given.

Schema extraction is a reverse-engineering process that extracts the implicit structural information and generates an explicit schema for a given dataset. Several algorithms that deliver a schema overview have been suggested for the different data formats XML [ 26 ] and JSON [ 3 , 23 ], in [ 34 ] different schema modifications for JSON data are derived (like clusters) and in [ 22 ] the complete schema history is constructed.

The reverse engineering of column types and the inference of integrity constraints like functional dependencies [ 4 , 21 ] and foreign keys/inclusion dependencies [ 22 , 24 ] are further subtasks in the field of data profiling.

For handling problems of low data quality, several classes of data cleaning methods have been developed. Outlier detection proves the datasets based on rules, pattern or similarity comparison and detects violations that are classified as potential data errors [ 6 , 16 , 39 ].

Duplicate elimination has to be applied to single data sources and also after integration of datasets from different data sources. Duplicate detection and merging of duplicate candidates based on distance functions between tuples and several methods have been developed to execute these tasks efficiently [ 18 , 30 , 31 ].

The imputation of missing values in datasets can be done with following methods: mean values or medians can be used, based on clustering the values can be estimated, blocks-wise iteration can be applied, artificial neural network algorithms and deep learning methods can also be applied to find the values.

Data transformation is another subtask of data preprocessing and realises the transformation between a source and a target structure. Each data transformation algorithm consists of matching source and target structures and mapping of the data into the target structure [ 8 , 14 , 25 ]. In this process datatype transformations can be realised. In some applications the data has to be transformed between different data models (e.g. NoSQL data or graph structures into relational data) and data integration that unifies data from different data sources in one database has to be executed. The well-studied data conflicts that have to be solved in these processes have originally been introduced in [ 20 ] and extended in [ 33 ]. Further research develops scalable data integration approaches [ 9 ].

The development of methods and implementations for the different data engineering subtasks is an ongoing task with a very active research community. Open research tasks are the adaptations of the available preprocessing methods onto new data formats, to enhance their applicability to heterogeneous data and to increase the scalability of all algorithms.

2.2 Second Generation: Data Engineering Pipelines

In the next generation of tools, the need for professionalisation of data engineering leads to tool boxes which enable the definition of data engineering pipelines that are repeatedly executed. This pipelining idea for combining data cleaning algorithms has been suggested in several publications [ 5 , 10 , 12 , 13 , 38 ]. In most tools implementing data engineering pipelines, these algorithms are applicable to different data formats, heterogeneous and distributed datasets. Thereby the diversity of input data is taken into account.

The toolboxes provide different algorithms for solving the dedicated data engineering subtasks and users have the opportunity to define processes which sequentially combine the different preprocessing algorithms. Some of these available toolsets are:

ETL tools for Data Warehouses and BI tools, e.g. Talend Footnote 2 , Tableau Prep Footnote 3 , Qlik Footnote 4

Python and data science libraries, e.g. NumPy Footnote 5 , pandas Footnote 6 , SciPy Footnote 7 , scikit-learn Footnote 8 , feature-engineering Footnote 9

Data preparation parts in data mining tools, e.g. Weka Footnote 10 , RapidMiner Footnote 11

Data wrangling/ Data Lake processing, e.g. Snowflake Footnote 12 , IBM InfoSphere DataStage Footnote 13

In these toolboxes, processes can be defined by composing available algorithms for continious execution. In several tools, some syntactical checks concerning the applicability of certain algorithms onto certain datasets are made (e.g. pre-test of data types and other data characteristics).

Second Generation: Data Engineering/Analytics Pipelines

Fig. 2 visualises such toolboxes and the definition of processes (like pipelines) based on the available algorithms. It is visualised that for each data engineering subtask different algorithms are available. Their selection and combination defines the workflow for a concrete preprocessing task.

We define toolsets as Second Generation of data engineering algorithms if they are providing numerous different methods for each preprocessing subtask and for all data models and are offering the opportunity to define processes. In these toolsets the composition of the pipelines is still a manual task which is up to the user.

2.3 Third Generation: From Pipelines to Intelligent Adaptation of Data Engineering Workflows

Sect. 2.1 showed that nowadays numerous algorithms are available and ready to be used for each data engineering subtask. Each data engineering algorithm newly developed is, at the time of its publication, compared with other algorithms that exist for the same task. This is usually done on one or more datasets and should include qualitative features (like precision) and quantitative features (like efficiency).

Despite these existing comparisons, it is not easy for users of the tools to decide which algorithms in which combination are most suitable for a specific task. This requires experiential knowledge and a deep understanding of all available methods and insights into the data characteristics.

This leads to an open research task: The choice of the most suitable algorithms for all subtasks and their composition has to be supported by the toolsets. Such user guidance could be provided in such a way that even if the composition task itself is up to the user, the toolset recommends applicable algorithms for each data engineering subtask, can predict expected results and can evaluate the data engineering process thus created.

Third Generation: Intelligent Advisers for Data Engineering Workflows

The current state of the art is a bit behind this ambiguous vision. Currently, toolsets provide various implementations for all data engineering subtasks. Often they also provide the information which algorithms cannot be executed on a certain dataset, e.g. because they are not applicable to certain data formats (relation, csv, NoSQL, streaming data), or if data types (numerical values, strings, enumerations, coordinates, timestamps) do not match. The choice of the algorithms and their combination is in most cases still up to the user. As the tools claim to be usable and operable for domain experts, too, an intelligent guidance of the user, an evaluation of the results and simulation of the effects of different algorithm application are the next functionalities that the data engineering field should develop and provide.

To achieve such user guidance in workflow compositions as sketched in Fig. 3 , the following building blocks are necessary:

Formal specification of the requirements

Algorithms for deriving formal metrics (e.g. schema, datatypes, pattern, constraints, data quality measures) from the datasets

Provision of the formal characteristics for each preprocessing algorithm in the repository of the toolset

Formal contracts on the pre- and postconditions for each algorithm

Development of a method that matches defined requirements and algorithm characteristics

Implementation of sample-based approaches for communication with the domain experts to explain preprocessing results

Evaluation of the results

This long enumeration shows that there is the need for further developments in this field at present and in the future, and that the data engineering research community is in demand here.

One very promising approach that could open an additional research direction in data engineering is currently under development in machine learning: care labels or consumer labels for machine learning algorithms [ 28 , 36 ]. Comparable to care labels for textiles or description of technical devices which provide instructions on how to care or clean textiles (or how to use machine learning algorithms). The basic idea is adding metadata which rate the characteristics of certain ML algorithms. These labels would, for instance, provide information on robustness, generalisation, fairness, accuracy, and privacy sensitivity. Currently, their focus is on the analysis algorithms. Their extension to data engineering algorithms would be helpful to support the user guidance in the complete data science process orchestration and would be a building block to fulfil requirement 3 in the above enumeration. Another similar technology that could be adapted for these tasks is the formal description method for web services that have a similar aim.

2.4 Fourth Generation: Automatic Data Curation

After this already highly ambitious Third Generation, the question arises as to which further future challenges exist in data engineering research.

Currently, the available data engineering simplifies many routine tasks and avoids programming effort for the preprocessing tasks. Thus, these tools deliver a comfortable support for computer science experts. But in many application fields, domain experts have to solve the data engineering tasks. For them, the same tools are not that easy to use. There are different approaches how to overcome this problem:

Interdisciplinary teams in Data Science projects

Professionals who are trained in certain application fields and computer science (the development of data science master courses has this aim)

Educational tasks for universities, teaching computer science in all university programs (e.g. natural sciences, engineering, humanities, environmental sciences, medicine)

Development of tools for automatic data curation

Whereas the first solutions generate requirements to be met by university teaching programs, we now concentrate on the last solution: automatic data curation and want to define necessities to allow domain experts to use data curation tools and enable them to solve data persistence and usage tasks.

To approach this, let us first look at the tasks performed by a human computer science specialist in charge of data engineering in any scientific field. To define this, we first look at the tasks of curation in other fields such as art which is defined as: “The action or process of selecting, organising, and looking after the items in a collection or exhibition” (Oxford dictionary).

If we try to adapt this concept to data curation we define this item as: “Data curation is the task of controlling which data is collected, generated, captured or selected, how it is completed, corrected and cleaned, in which schema, data format and system it is stored and how it is made available for evaluations and analytics in the long term.”

Automatic data curation describes the aim to automate part of the data curation process and develop tools which either execute a certain subtask fully automated or generate recommendations and guide domain experts’ decisions (semi-automatic approach).

Fourth Generation: Automatic Data Curation

The following vision has to be realised: The input data are datasets from a certain application that the domain experts either have created or that are the result of scientific experiments. An intelligent data curation toolset solves the following subtasks:

Analysis of the entire data

Provides information about available standard formats and standard metadata formats in this specific field of science and based on this suggests a target data format how to store or archive the data

Checks of the data quality

Intelligent guidance to clean data

Suggests additional data sources to complete data

Transforms the data into the target format and

Extracts the metadata for catalogues

The main difference to the Third Generation is that users need not define the target data structure in advance, as this guidance is also part of the data curation tool. This process is shown in Fig. 4 . Input information is a dataset (on the left-hand side) and information about available schemas/standards in an application domain (on the right-hand side in Fig. 4 ). Based on this, the selection of the target format and guidance for the data engineering subtasks (cleaning and transformation) is provided. The choice of the target format can be based on calculated distances between the input datasets and the set of available standards in the dedicated science field. For this, matching algorithms [ 14 , 33 ] from data integration can be applied.

The aim is to provide as much guidance as possible, supporting the choice of the target format and each data preprocessing step by recommender functions. The communication with the domain experts has to be done at each point in time with a sample-based approach, an intuitively visualisation or (pseudo-)natural language dialogue.

Development of such tools for automatic data curation is an ongoing demanding task and future work for our community. The aim is to develop data engineering tools for domain scientists that are as easy to use and as intuitive as apps to provide content in social networks or WYSIWYG-Website editors.

3 Conclusion and Future Tasks

With this bold attempt to classify an entire field of science, we want to make the current and future development goals clear. The different generations of methods neither represent a chronological classification nor a valuation of the quality of the individual works. For example, there are currently high-quality works that focus on the solution of a single subtask in data engineering which achieve excellent results. In this classification, these research results would be assigned to the First Generation because they make significant scientific contributions with the development of a dedicated algorithm. Fig. 5 gives a very abstract visualisation on the relationships between the different generations.

Interconnection between the Four Generations of Data Engineering Approaches

The First Generation includes all approaches that develop a solution to a concrete data engineering task (these are several independent fields with a partial overlap, e.g. the calculation of distance functions is part of several approaches). The Second Generation represents the sequential connection of these algorithms into pipelines. In the Third Generation user guidance to compose workflows from the individual algorithms is added and in the Fourth Generation we have presented the notion of extensive support in data curation.

In each of these classes there are many open questions that represent the research tasks of the future. The main directions of this further research are:

Optimisation of each algorithm for a dedicated data engineering subtask

Providing implementations that are applicable for non computer-scientists out-of-the-box

Evaluating the results of the data engineering processes (including data lineage approaches)

Tight coupling between data engineering algorithms, machine learning implementations and result visualisation methods and the joint development of cross-cutting techniques

Development of toolsets that can provide several available data engineering algorithms and that can also be used by application experts

By-example approaches for communication with domain experts, comparable to query-by-example approaches for relational databases [ 40 ]

All four generations face significant technical challenges to maintain and evolve systems [ 35 ] and to manage evolving data [ 15 ] which are also a task for future developments.

In summary, the field of data engineering has ambitious goals for the development of further methods and tools that require a sound theoretical basis in computer science. Future development should also be increasingly interdisciplinary so that the results can be applied to all data-driven sciences.

At the same time, there is the major task of teaching computer science topics like data engineering, data literacy, machine learning, and data analytics in university education to reach future application experts in these application domains.

“… most data scientists spend at least 80 percent of their time in data prep.” [ 2 ] and “Data preparation accounts for about 80% of the work of data scientists” [ 32 ].

http://www.talend.com .

http://www.tableau.com/products/prep .

http://www.qlik.com .

http://www.numpy.org .

http://www.pandas.pydata.org .

http://www.scipy.org .

http://www.scikit-learn.org .

http://www.pypi.org/project/feature-engine .

http://www.cs.waikato.ac.nz/ml/weka/ .

http://www.rapidminer.com .

http://www.snowflake.com .

http://www.ibm.com/it-infrastructure .

Abedjan Z, Golab L, Naumann F, Papenbrock T (2018) Data profiling. Synthesis lectures on data management. Morgan & Claypool Publishers,

Google Scholar

Analytics India Magazine (2017) Interview with Michael Stonebraker. https://analyticsindiamag.com/interview-michael-stonebraker-distinguished-scientist-recipient-2014-acm-turing-award . Accessed: 18 Dec 2021

Baazizi MA, Colazzo D, Ghelli G, Sartiani C (2019) Parametric schema inference for massive JSON Datasets. VLDB J 28(4):497–521

Bleifuß T, Bülow S, Frohnhofen J, Risch J, Wiese G, Kruse S, Papenbrock T, Naumann F (2016) Approximate discovery of functional dependencies for large datasets. In: CIKM

Boehm M, Kumar A, Yang J (2019) Data management in machine learning systems. Synthesis lectures on data management. Morgan & Claypool Publishers,

Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58. https://doi.org/10.1145/1541880.1541882

Article Google Scholar

Dimitriadou K, Papaemmanouil O, Diao Y (2014) Explore-by-example: an automatic query steering framework for interactive data exploration. SIGMOD

Book Google Scholar

Dong XL, Halevy A, Yu C (2009) Data integration with uncertainty. VLDB J 18(2):469–500

Dong XL, Srivastava D (2013) Big data integration. In: Proc. ICDE. IEEE

Furche T, Gottlob G, Libkin L, Orsi G, Paton NW (2016) Data wrangling for big data: challenges and opportunities. In: Proc. EDBT, vol 16

García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Intelligent systems reference library, vol 72. Springer,

Golshan B, Halevy AY, Mihaila GA, Tan W (2017) Data integration: after the teenage years. In: Proc. PODS. ACM

Grafberger S, Stoyanovich J, Schelter S (2021) Lightweight inspection of data preprocessing in native machine learning pipelines. In: Proc. CIDR

Halevy A, Rajaraman A, Ordille J (2006) Data integration: the teenage years. In: Proc. VLDB

Hillenbrand A, Levchenko M, Störl U, Scherzinger S, Klettke M (2019) Migcast: putting a price tag on data model evolution in NoSQL data stores. In: Proc. SIGMOD

Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

Idreos S, Papaemmanouil O, Chaudhuri S (2015) Overview of data exploration techniques. In: SIGMOD

Ilyas IF, Chu X (2015) Trends in cleaning relational data: consistency and deduplication. Found Trends Databases 5(4):281–393

Inmon WH (2005) Building the data warehouse, 4th edn. Wiley,

Kim W, Seo J (1991) Classifying schematic and data heterogeneity in multidatabase systems. Computer 24(12):12–18

Klettke M (1998) Akquisition von Integritätsbedingungen in Datenbanken. Infix Verlag, St. Augustin

Klettke M, Awolin H, Störl U, Müller D, Scherzinger S (2017) Uncovering the evolution history of data lakes. In: Proc. SCDM@IEEE BigData

Klettke M, Störl U, Scherzinger S (2015) Schema extraction and structural outlier detection for JSON-based NoSQL data stores. In: Proc. BTW

Kruse S, Papenbrock T, Dullweber C, Finke M, Hegner M, Zabel M, Zöllner C, Naumann F (2017) Fast approximate discovery of inclusion dependencies. In: BTW

Lenzerini M (2002) Data integration: a theoretical perspective. In: Proc. PODS

Moh C, Lim E, Ng WK (2000) DTD-miner: a tool for mining DTD from XML documents. In: Proc. WECWIS

Möller ML, Berton N, Klettke M, Scherzinger S, Störl U (2019) jHound: large-scale profiling of open JSON data. In: Proc. BTW

Morik K, Kotthaus H, Heppe L, Heinrich D, Fischer R, Pauly A, Piatkowski N (2021) The care label concept: a certification suite for trustworthy and resource-aware machine learning. In: CoRR

Nargesian F, Zhu E, Miller RJ, Pu KQ, Arocena PC (2019) Data lake management: challenges and opportunities. In: Proc. VLDB Endow

Naumann F, Herschel M (2010) An introduction to duplicate detection. Synth Lect Data Manag. https://doi.org/10.2200/S00262ED1V01Y201003DTM003

Article MATH Google Scholar

Panse F (2014) Duplicate detection in probabilistic relational databases. Ph.D. thesis, Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky

Press G (2016) Cleaning big data: Most time-consuming, least enjoyable data science task. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=5bf8476f637d . Accessed: 18 Dec 2021

Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350

Ruiz DS, Morales SF, Molina JG (2015) Inferring versioned schemas from NoSQL databases and its applications. In: Proc. ER, vol 9381. Springer,

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J, Dennison D (2015) Hidden technical debt in machine learning systems. In: Advances in neural information processing systems

Seifert C, Scherzinger S, Wiese L (2019) Towards generating consumer labels for machine learning models. In: Proc. CogMI. IEEE

Shang Z, Zgraggen E, Buratti B, Kossmann F, Eichmann P, Chung Y, Binnig C, Upfal E, Kraska T (2019) Democratizing data science through interactive curation of ML pipelines. In: SIGMOD

Terrizzano IG, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. In: Proc. CIDR

Wang H, Bah MJ, Hammad M (2019) Progress in outlier detection techniques: a survey. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2932769

Zloof MM (1975) Query-by-example: the invocation and definition of tables and forms. In: VLDB

Download references

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

University of Rostock, Rostock, Germany

Meike Klettke

University of Hagen, Hagen, Germany

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meike Klettke .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Klettke, M., Störl, U. Four Generations in Data Engineering for Data Science. Datenbank Spektrum 22 , 59–66 (2022). https://doi.org/10.1007/s13222-021-00399-3

Download citation

Received : 29 May 2021

Accepted : 22 November 2021

Published : 22 December 2021

Issue Date : March 2022

DOI : https://doi.org/10.1007/s13222-021-00399-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Data cleaning
Data integration
Data engineering pipelines
Data curation
Find a journal
Publish with us
Track your research

What's new: Redpanda Cloud launches on Azure. Plus, meet Redpanda 24.2!

Data engineering 101

Fundamentals of data engineering.

In today's data-driven environment, businesses continuously face the challenge of harnessing and interpreting vast amounts of information. Data engineering is a crucial intersection of technology and business intelligence and plays a critical role in everything from data science to machine learning and artificial intelligence.

So, what makes data engineering indispensable? In a nutshell: its ability to convert raw data into actionable insights.

With the explosion of data sources – from website interactions, transactions, and social media engagements to sensor readings – businesses are generating data at an unparalleled rate. Data engineering equips us with the tools and methodologies needed to gather, process, and structure the data, ensuring it is ready for analysis and decision-making.

This fundamentals of data engineering guide offers a broad overview, preparing readers for a more detailed exploration of data engineering principles.

Summary of fundamentals of data engineering

Concept	Description
Data engineering	The discipline focuses on preparing “big data” for analytical or operational uses.
Use cases	Practical scenarios where data engineering plays a pivotal role, such as e-commerce analytics or real-time monitoring.
Data engineering lifecycle	The stages, from data ingestion to analytics, encompass integration, transformation, warehousing, and maintenance.
Data pipelines	A visual flow of the entire data engineering process, highlighting how data moves through each stage.
Batch vs. stream processing	Distinguishing between processing data in large sets (batches) versus real-time (stream) processing.
Data engineering best practices	Established methods and strategies in data engineering to ensure data integrity, efficiency, and security.
Data engineering vs. artificial intelligence	Differentiating the process of preparing data for AI applications from using AI to enhance data engineering tasks.

What is data engineering?

Data engineering is the process of designing, building, and maintaining systems within a business that enable the deriving of meaningful insights from operational data. In an era where data is frequently likened to oil or gold, data engineering emerges as the refining process that refines the raw data into a potent fuel for innovation and strategy.

Data engineering uses various tools, techniques, and best practices to achieve end goals. Data is collected from diverse sources like human-generated forms, human and system-generated content like documents, images, videos, transaction logs, IoT systems, geolocation data and tracking, application logs, and events. It results in data that fits into three broad categories.

Structured data organized in databases with a clear schema, often in tabular formats like SQL databases.
Unstructured data like images, videos, emails, and text documents that cannot fit into schemas.
Semi-structured data that includes both structured and unstructured elements.

Each dataset and its use case for analysis requires a different strategy. For example, some data types are processed infrequently in batches, while others are processed continuously as soon as they are generated. Sometimes, data integration is done from several sources, and all data is stored centrally for analytics. At other times, subsets of data are pulled from different sources and prepared for analytics.

Tools and frameworks like Apache Hadoop, Apache Spark™, Apache Kafka®, Airflow, Redpanda, Apache Beam®, Apache Flink®, and more exist to implement the different data engineering approaches. The diverse landscape of tools ensures flexibility, scalability, and performance, regardless of the nature or volume of data.

Overview of the layers involved in data engineering

Data engineering remains a dynamic and indispensable field in our data-centric world.

[CTA_MODULE]

Data engineering use cases

Data engineering is required in almost all aspects of modern-day computing.

Real-time analytics

Real-time analytics offer valuable information for businesses requiring immediate insights that can drive rapid decision-making processes. It is indispensable in everything from monitoring customer engagement to tracking supply chain efficiency.

Customer 360

Data engineering enables businesses to develop comprehensive customer profiles by collating data from multiple touchpoints. This can include purchase history, online interactions, and social media engagement, helping to offer more personalized experiences.

Fraud detection

Financial, gaming, and similar applications rely on complex algorithms to detect abnormal patterns and potentially fraudulent activities. Data engineering provides the structure and pipelines to analyze vast amounts of transaction data, often in near real-time.

Health monitoring systems

In healthcare, data engineering is vital in developing systems that can aggregate and analyze patient data from various sources, such as wearable devices, electronic health records, and even genomic data for more accurate diagnoses and treatment plans.

Data migration

Transitioning data between systems, formats, or storage architectures is complex. Data engineering provides tools and methodologies to ensure smooth, lossless data migration, enabling businesses to evolve their infrastructure without data disruption.

Artificial intelligence

The era of digitization has ushered in an exponential surge in data generation. Businesses looking to harness the power of this data are increasingly turning to artificial intelligence (AI) and machine learning (ML) technologies. However, the success of AI and ML hinges predominantly on the quality and structure of data the system receives.

This has inherently magnified the importance and complexity of data engineering. AI models require timely and consistent data feeds to function optimally. Data engineering establishes the pipelines feeding these algorithms, ensuring that AI/ML models train on high-quality datasets for optimal performance.

The data engineering lifecycle

The data engineering lifecycle is one of the key fundamentals of data engineering. It focuses on the stages a data engineer controls. Undercurrents are key principles or methodologies that overlap across the stages.

Stages of the cycle

Data ingestion incorporates data from generating sources into the processing system. For instance, in the push model, data from the source system gets written to the desired destination, while in the pull model, it is the other way around. The line separating push and pull methodologies blurs as data transits through numerous stages in a pipeline. Nevertheless, mastering data ingestion is paramount to ensuring the seamless flow and preparation of data for subsequent analytical stages.

Data transformation refines raw data through operations that enhance its quality and utility. For example, it normalizes values to a standard scale, fills gaps where data might be missing, converts between data types, or adds even more complex operations to extract specific data features. The goal is to mold the data into a structured, standardized format primed for analytical operations.

Data serving makes processed and transformed data available for end-users, applications, or downstream processes. It delivers data in a structured and accessible manner, often through APIs. It ensures that data is timely, reliable, and accessible to support various analytical, reporting, and operational needs of an organization.

Data storage is the underlying technology that stores data through the various data engineering stages. It bridges diverse and often isolated data sources—each with its own fragmented data sets, structure, and format. Storage merges the disparate sets to offer a cohesive and consistent data view. The goal is to ensure data is reliable, available, and secure.

Key considerations

There are several key considerations or “undercurrents” applicable throughout the process. They have been elaborated in detail in the book, " Fundamentals of Data Engineering ," (which you can download for free ). Here’s a quick overview:

Data engineers prioritize security at every stage so that data is accessible only to authorized users. They adhere to the principle of least privilege as a best practice, so users only access what is necessary for their work and for the required duration only. Data is often encrypted as it moves through the stages and in storage.

Data management

Data management provides frameworks that incorporate a broader perspective of data utility across the organization. It encompasses various facets like data governance, modeling, lineage, and meeting ethical and privacy considerations. The goal is to align data engineering processes with an organization's broader legal, financial, and cultural policies.

DataOps applies principles from Agile, DevOps, and statistical process control to enhance data product quality and release efficiency. It combines people, processes, and technology for improved collaboration and rapid innovation. It fosters transparency, efficiency, and cost control at every stage.

Data architecture

Data architecture supports an organization’s long-term business goals and strategy. This involves knowing the trade-offs and making informed choices about design patterns, technologies, and tools that balance cost and innovation.

Software engineering

While data engineering has become more abstract and tool-driven, data engineers still need to write core data processing code proficiently in different frameworks and languages. They must also employ proper code-testing methodologies and may need to solve custom coding problems beyond their chosen tools, especially when managing infrastructure in cloud environments through Infrastructure as Code (IaC) frameworks.

Data engineering best practices

Navigating the data engineering world demands precision and a deep understanding of best practices. Low-quality data leads to skewed analytics, resulting in poor business decisions.

Best practice	Importance
Proactive data monitoring	Regularly checks datasets for anomalies to maintain data integrity. This includes identifying missing, duplicate, or inconsistent data entries.
Schema drift management	Detects and addresses changes in data structure, ensuring compatibility and reducing data pipeline breaks.
Continuous documentation	Manages descriptive information about data, aiding in discoverability and comprehension.
Data security measures	Controls and monitors access to data sources, enhancing security and compliance.
Version control and backups	Tracks change to datasets over time, aiding in reproducibility and audit trails.

Proactive data monitoring

Monitoring data quality should be an ongoing, active process, not a passive one. Regularly checking datasets for anomalies ensures that issues like missing or duplicate data are identified swiftly. Implementing automated data quality checks during data ingestion and transformation is crucial. Leveraging tools that notify of discrepancies allows for immediate intervention and corrections.

A tool like Apache Griffin can be used to measure data quality across platforms in real-time, providing visibility into data health. Data engineers also perform rigorous validation checks at every data ingestion point, leveraging frameworks like Apache Beam® or Deequ. An example in practice is e-commerce platforms ensuring valid email formats and appropriate phone number entries.

Schema drift management

Schema drift—unexpected changes in data structure—can disrupt data pipelines or lead to incorrect data analysis. It can result from scenarios like an API update altering data fields. To handle schema drift, data engineers can:

Utilize dynamic schema solutions that adjust to data changes in real time.
Perform regular audits and validate data sources.
Integrate version controls for schemas, maintaining a historical record.

In a Python-based workflow using Pandas, detecting schema drift looks like the one below.

Continuous documentation

Maintaining up-to-date documentation becomes vital with the increasing complexity of data architectures and workflows. It ensures transparency, reduces onboarding times, and aids in troubleshooting. When multiple departments intersect, such as engineers processing data for a marketing team, a well-documented process ensures trust and clarity in data interpretation for all stakeholders.

Data engineers use platforms like Confluence or GitHub Wiki to ensure comprehensive documentation for all pipelines and architectures. Making documentation a mandatory step in your data pipeline development process is one of the key fundamentals of data engineering. Use tools that allow for automated documentation updates when changes in processes or schemas occur.

Data security measures

As data sources grow in number and variety, ensuring the right people have the right access becomes crucial for both data security and efficiency. Understanding a data piece's origin and journey is critical for maintaining transparency and aiding in debugging.

Tools like Apache Atlas offer insights into data lineage—a necessity in sectors where compliance demands tracing data back to its origin. Systems like Apache Kafka® append changes as new records, a practice especially crucial in sectors like banking. Automated testing frameworks like Pytest and monitoring tools like Grafana all contribute to proactive data security.

Some security best practices include:

Implement a federated access management system that centralizes data access controls.
Regularly review and update permissions to reflect personnel changes and evolving data usage requirements.
Avoid direct data edits that can corrupt data.

In a world of increasing cyber threats, data breaches like the Marriott incident of 2018 underscore the importance of encrypting sensitive data and frequent access audits to comply with regulations like GDPR.

Version control and backups

As with software development, version control in data engineering allows for tracking changes, reverting to previous states, and ensuring smooth collaboration among data engineering teams. Integrate version control systems like Git into your data engineering workflow. Regularly back up not just data but also transformation logic, configurations, and schemas.

Incorporating these best practices into daily operations bolsters data reliability and security—it elevates the value that data engineering brings to an organization. Adopting and refining these practices will position you at the forefront of the discipline, paving the way for innovative approaches and solutions in the field.

Emerging trends & challenges

As data sources multiply, the process of ingesting, processing, and transforming data becomes cumbersome. Systems must scale to avoid becoming bottlenecks. Automation tools are stepping in to streamline data engineering processes, ensuring data pipelines remain robust and efficient. Data engineers are increasingly adopting distributed data storage and processing systems like Hadoop or Spark. Netflix's adoption of a microservices architecture to manage increasing data is a testament to the importance of scalable designs.

The shift towards cloud-based storage and processing solutions has also revolutionized data engineering. Platforms like AWS, Google Cloud, and Azure offer scalable storage and high-performance computing capabilities. These platforms support the vast computational demands of data engineering algorithms and ensure data is available and consistent across global architectures.

AI's rise has paralleled the evolution of data-driven decision-making in businesses. Advanced algorithms can sift through vast datasets, identify patterns, and offer previously inscrutable insights. However, these insights are only as good as the data they're based on. The fundamentals of data engineering are evolving with AI.

Using data engineering in AI

AI applications process large amounts of visual data. For example, optical character recognition converts typed or handwritten text images into machine-encoded text. Computer vision applications train machines to interpret and understand visual data. Images and videos from different sources, resolutions, and formats need harmonization. The input images must be of sufficient quality, and data engineers often need to preprocess these images to enhance clarity. Many computer vision tasks require labeled data, demanding efficient tools for annotating vast amounts of visual data.

AI applications can also learn and process human language. For instance, they can identify hidden sentiments in content, summarize and sort documents, and translate from one language to another. These AI applications require data engineers to convert text into numerical vectors using embeddings. The resulting vectors can be extensive, demanding efficient storage solutions. Real-time applications require rapid conversion into these embeddings, challenging data infrastructure's processing speed. Data pipelines have to maintain the context of textual data. It also involves data infrastructure capable of handling varied linguistic structures and scripts.

Large language models(LLMs)like OpenAI's GPT series are pushing the boundaries of what's possible in natural language understanding and generation. These models, trained on extensive and diverse text corpora, require:

Scale— The sheer size of these models necessitates data storage and processing capabilities at a massive scale.
Diversity— To ensure the models understand the varied nuances of languages, data sources need to span numerous domains, languages, and contexts.
Quality— Incorrect or biased data can lead LLMs to produce misleading or inappropriate outputs.

Using AI for data engineering

The relationship between AI and data engineering is bidirectional. While AI depends on data engineering for quality inputs, data engineers also employ AI tools to refine and enhance their processes. The inter-dependency underscores the profound transformation businesses are undergoing. As AI continues to permeate various sectors, data engineering expectations also evolve, necessitating a continuous adaptation of skills, tools, and methodologies.

Here's a deeper dive into how AI is transforming the fundamentals of data engineering:

Automated data cleansing

AI models can learn the patterns and structures of clean data. They can automatically identify and correct anomalies or errors by comparing incoming data to known structures. This ensures that businesses operate with clean, reliable data without manual intervention, thereby increasing efficiency and reducing the risk of human error.

Predictive data storage

AI algorithms analyze the growth rate and usage patterns of stored data. By doing so, they can predict future storage requirements. This foresight allows organizations to make informed decisions about storage infrastructure investments, avoiding overprovisioning and potential storage shortages.

Anomaly detection

Machine learning models can be trained to recognize "normal" behavior within datasets. When data deviates from this norm, it's flagged as anomalous. Early detection of anomalies can warn businesses of potential system failures, security breaches, or even changing market trends. ( Tip: check out this tutorial on how to build a real-time anomaly detection using Redpanda and Bytewax .)

Along with detecting anomalies, AI can also help with discovering and completing missing data points in a given dataset. Machine learning models can predict and fill in missing data based on patterns and relationships in previously known data. For instance, if a dataset of weather statistics had occasional missing values for temperature, an ML model could use other related parameters like humidity, pressure, and historical temperature data to estimate the missing value.

Data categorization and tagging

NLP models can automatically categorize and tag unstructured data like text, ensuring it's stored appropriately and is easily retrievable. This automates and refines data organization, allowing businesses to derive insights faster and more accurately.

Optimizing data pipelines

AI algorithms can analyze data flow through various pipelines, identifying bottlenecks or inefficiencies. By optimizing the pipelines , businesses can ensure faster data processing and lower computational costs.

Semantic data search

Rather than relying on exact keyword matches, AI-driven semantic searches understand the context and intent behind search queries, allowing users to find data based on its meaning. This provides a more intuitive and comprehensive data search experience, especially in vast data lakes.

Data lineage tracking

AI models can trace the journey of data from its source to its final destination, detailing all transformations along the way. This ensures transparency, aids in debugging, and ensures regulatory compliance.

In essence, the integration of AI into data engineering is a game-changer. As AI simplifies and enhances complex data engineering tasks, professionals can focus on strategic activities, pushing the boundaries of what's possible in data-driven innovation. The potential of this synergy is vast, promising unprecedented advancements in data efficiency, accuracy, and utility.

For those seeking to harness the power of data in today's digital age, mastering data engineering becomes not just an advantage—but a necessity. Data engineering converts data into meaningful, actionable insights that drive decision-making and innovation. The fundamentals of data engineering include its real-world applications, lifecycle, and best practices. While the principles remain the same, the technology is continuously evolving as the requirements and challenges of data applications change.

Each chapter in this guide offers a deep dive into specific facets of data engineering, allowing readers to understand and appreciate its complexities and nuances.

Institute for Data Engineering and Science

Ideas: the new home of big data research and solutions.

Recognizing the importance of big data and high-performance computing, Georgia Tech led the development of the new Coda complex , a 21-story building with an 80,000 square foot data center. IDEaS is an anchor tenant of the new building, built in Georgia Tech's Technology Square in Midtown Atlanta. Coda's design facilitates collaboration between researchers in various disciplines, which accelerates the transition of the computing industry from its computer-centric roots to its data-centric future. This endeavor entails the restructuring of the modern computing ecosystem centered on the secure and timely acquisition, distillation, storage, modeling, and analysis of data in driving decisions in all sectors of the economy.

Research Overview

Data Science is an interdisciplinary field that is concerned with systems, storage, software, algorithms, and applications for extracting knowledge or insights from data. Data-driven research is also commonplace in many fields of sciences and engineering, where direct observations (astronomy), instrumentation (sensors, DNA sequencers, electron microscopes), or simulations, (molecular dynamics trajectories), generate datasets that must be analyzed with domain-specific knowledge. Recently, our ability to collect and store massive datasets that are typically characterized by high volume, velocity, or variety, and inadequacy of current techniques to handle such large data sizes, led to the coining of the term “Big Data.”

IDEAS Research Areas

Machine learning.

Underpins the transformation of data to knowledge to actionable insights. Research in unstructured and dynamic data, deep learning, data mining, and interactive machine learning advances foundations and big data applications in many domains.

High Performance Computing

Critical technology for big data analysis. High performance systems, middleware, algorithms, applications, software, and frameworks support data-driven computing at all levels.

Algorithms and Optimization

Algorithms, optimization, and statistics are laying the foundations for large-scale data analysis. Streaming and sublinear algorithms, sampling and sketching techniques, high-dimensional analysis are enabling big data analytics.

Health and Life Sciences

Big data sets abound in genomics, systems biology, and proteomics. Advances in electronic medical records, computational phenotyping, personalized genomics, and precision medicine are driving predictive, preventive, and personalized healthcare.

Materials and Manufacturing

Large-scale data sets providing a microscopic view of materials, and scalable modeling and simulation technologies, are paving the way for accelerated development of new materials.

Energy Infrastructure

Advances in sensors and the Internet of Things enable energy infrastructure monitoring. Data analytics brings unparalleled efficiencies to energy production, transmission, distribution, and utilization.

Smart Cities

Achieving efficient use of resources and services, safety, affordability, and a higher quality of life using data-based research. Internet of Things research uses big data and analytics from massive streams of real-time data and applies it to smart city initiatives.

This website uses cookies. For more information, review our Privacy & Legal Notice Questions? Please email [email protected]. More Info Decline --> Accept

Master of Science in Engineering Data Science

Data science is poised to play a vital role in research and innovation in the 21st century. Google, Facebook, Amazon, and Youtube are just some prominent examples which highlight the increasing impact of data science in our day-to-day life. The singularity which will facilitate the transition of our modern society to a science fiction-esque one is on the cusp of being realized due to the so-called data revolution, especially in the field of engineering.

Engineering Data Science is a broad field that encompasses predictive modeling and data-driven design of engineering systems. Applications range from health sciences and environmental sciences, to materials science, manufacturing, autonomous cars, image processing, and cybersecurity.

The demand for graduates with a data science background is already high and is growing rapidly across a wide range of industries worldwide. Houston, being the energy capital of the world as well as the home of a thriving healthcare industry, is also seeing a persistent demand for a workforce well-trained in data science. To provide state-of-the-art training for a data-centric workforce, the Cullen College of Engineering offers a Master of Science in Engineering Data Science.

The Master of Science in Engineering Data Science at the University of Houston is a 10 course graduate curriculum program with both non-thesis and thesis options.

A four-year bachelor's degree in engineering or engineering related fields, or computer science and data science and statistics is required in order to apply for the Engineering Data Science program.

The degree plan consists of courses in three primary categories. These courses may be available online and face-to-face in a classroom setting (A full online M.S. in Engineering Data Science is not available at this moment).

The Master of Science in Engineering Data Science is a Science, Technology, Engineering ,and Mathematics (STEM) degree. There is a STEM OPT extension which is a 24-month period of temporary training for F-1 visa students in an approved STEM field. Please visit this link for more information.

Application

For information on admission requirements and application process please click here .

Application Deadlines	International	Domestic
Fall Semester	March 15 (*Priority) May 15 (Regular)	March 15 (*Priority) May 15 (Regular)
Spring Semester	September 15	September 15

*** This program does not offer summer intake ***

Degree Plan

(30 Credit Hours Requirement) MS in Engineering Data Science program requires 30 credit hours (10 courses). We offer both thesis and non-thesis options.

Core Courses

9 Credit Hours/ 3 courses (for both thesis and non-thesis options)

Course Code	Course Name	Credit Hours
EDS 6333	Probability and Statistics	3
EDS 6340	Introduction to Data Science	3
EDS 6342	Introduction to Machine Learning	3

Note: Students in their first semester of the degree should enroll in these three core courses.

Prescribed Elective Courses

9 Credit Hours/ any 3 courses of the following (for both thesis and non-thesis options)

Course Code	Course Name	Credit Hours
INDE 7397 or PETR 6397	Big Data and Analytics or Big Data Analytics	3
ECE 6364	Digital Image Processing	3
ECE 6397	Signal Processing and Networking for Big Data Applications	3
EDS 6344	AI for Engineers	3
EDS 6346	Data Mining for Engineers	3
EDS 6348	Introduction to Cloud Computing	3
ECE 6342	Digital Signal Processing	3
INDE 7397	Engineering Analytics	3
INDE 6372	Advanced Linear Optimization	3
EDS 6397	Information Visualization	3

Elective Courses

Non-thesis option: 12 credit hours / any 4 courses of the following Thesis option: 3 credit hours / any 1 course of the following

Course Code	Course Name	Credit Hours
BIOE 6305	Brain Machine Interfacing	3
BIOE 6306	Advanced Artificial Neural Networks	3
BIOE 6309	Neural Interfaces	3
BIOE 6340	Quantitative Systems Biology & Disease	3
BIOE 6342	Biomedical Signal Processing	3
BIOE 6346	Advanced Medical Imaging	3
BIOE 6347	Introduction to Optical Sensing and Biophotonics	3
BIOE 6345	Biomedical Informatics	3
BZAN 6354	Database Management for Business Analytics	3
CIVE 6393	Geostatistics	3
CIVE 6380	Introduction to Geomatics/Geosensing	3
CIVE 6382	Lidar Systems and Applications	3
CIS 6397	Python for Data Analytics	3
CHEE 6367	Advanced Proc Control	3
ECE 6376	Digital Pattern Recognition	3
ECE 6397	Sparse Representations in Signal Processing	3
ECE 6337	Stochastic Processes in Signal Processing and Data Science	3
ECE 6378	Power System Analysis	3
ECE 6342	Digital Signal Processing	3
ECE 6333	Signal Detection and Estimation Theory	3
ECE 6315	Neural Computation	3
ECE 6397	GPU Programming	3
ECE 6397	High Performance Computing	3
ECE 6325	State-Space Control Systems	3
INDE 6370	Operation Research-Digital Simulation	3
INDE 6336	Reliability Engineering	3
INDE 7340	Integer Programming	3
INDE 7342	Nonlinear Optimization	3
INDE 6363	Statistical Process Control	3
IEEM 6360	Data Analytics for Engineering Managers	3
MECE 6379	Computer Methods in Mechanical Design	3
MECE 6397	Data Analysis Methods	3

Courses For Thesis Option

(9 Credit Hours: research / thesis work)

Course Code	Course Name	Credit Hours
EDS 6398	Research Credit Hours	3
EDS 6399	Thesis Credit Hours	3
ECE 7399	Thesis Credit Hours	3

Note: The research/thesis credit hours for the thesis option may be taken over two or three semesters. The thesis examination committee must be approved by the Program Director prior to the defense date. The committee must consist of at least three tenure-track faculty members with at least two committee members from within the College of Engineering.

To learn more about the thesis option or if you have a MS advisor and wish to add the thesis option to your degree plan, please contact the academic advisor at egrhpc [at] uh.edu (egrhpc[at]uh[dot]edu) .

Academic Requirements

Students must have an overall GPA of 3.0 or higher in order to graduate with a MS degree in Engineering Data Science.

Each student should assume responsibility for being familiar with the academic program requirements as stated in the current catalogs of the college, university and this website.

For further information on academic requirements, please review the UH Graduate Catalog
For further information student rules and regulations, please review the UH Student Handbook

Note: Students must receive a grade of C- and above to pass a course. If a student receives a grade of D+ or below, that course will not be counted towards the completion of the degree plan. However, the grade will always be counted in the calculation of the cumulative GPA.

TUITION AND COST

The MS in Engineering Data Science is a 30 credit hours (10 courses) program. Students with full-time enrollment typically complete the program in a year and a half to two years. Here is the link to the Graduate Tuition Calculator which will give you an estimate of the costs.

FREQUENTLY ASKED QUESTIONS

If you have questions, please look at our extensive list of FAQs. If your question is not included here, please contact us at egrhpc [at] uh.edu (egrhpc[at]uh[dot]edu)

What is the status of my application? When will I hear about the decision? All the applications will be reviewed close to the application deadline. You will be informed about the decision around that time.

Can I receive an application fee waiver? Unfortunately, we are not able to offer the application fee waiver for MS students at this time. However, we do offer a competitive Dean’s scholarship of $1,000 to qualified students for the first year. If you are awarded with this scholarship, you become eligible to pay the in-state tuition which is a huge saving. This scholarship may be renewable if your CGPA is 3.75 or higher after your first academic year’s study. Please visit this link for more information.

How long does it take to complete the degree program? The MS in Engineering Data Science is a 30 credit hours (10 courses) program. Full time enrolled students typically complete the program in a year and a half to two years.

What is full-time enrollment? Students will need to enroll in a minimum of 3 courses (9 credit hours) for full-time enrollment. Students will need to maintain full-time enrollment if they are an international student on a visa AND/OR they are a recipient of a scholarship.

Are there resources for Internships and Jobs? We have the Engineering Career Center which hosts career fairs every semester for Engineering students as well as for graduates after six months after graduation. You can view the website for more information about employers and partners of the Career Center.

How can I transfer to the Engineering Data Science program? If you are already enrolled in another program at UH, at the end of this semester, you can petition to change from your current program to Engineering Data Science. Your petition will have to be approved by both programs (current program stating you are in good standing and Engineering Data Science stating you are admissible). Transfer is not guaranteed. The petition will be reviewed and the decision will be based on the grades and courses taken in the current program and your application package when applying to UH. Petition: Graduate Student General Petition

What is the status of my I-20? The I-20 is being processed at our Graduate School. If they have questions or need any additional documents, they will contact you directly via email. Once I-20 is complete, you can view it and print it out for your visa interview at ApplyWeb under your account. Should you need to contact Graduate School, the email address is as follows: gradschool [at] uh.edu (gradschool[at]uh[dot]edu) Please always include your full name and PSID# in your emails. Please be advised that Graduate School is in charge of I-20s for all the international students of UH which may take some time to complete. Thank you in advance for your patience.

I was offered admission but with no scholarship. When will I get the scholarship? The Dean’s scholarship is very competitive. A student may be eligible for scholarship in their second year (or third semester), if the student maintains a GPA of 3.75 or above and based on availability of scholarship funds. If you do receive the scholarship as described above, in that semester you will be eligible for in-state tuition.

How can I get TA/RA assistantships? Typically, we do not offer assistantships to MS students. The teaching/research assistantships are available to PhD students only. If you are interested in student employment, please visit the University Career Services website . You will be able to find out the types of employment that is available as well as eligibility.

I have a PDV hold on my account. How can I remove it to enroll in classes? According to our Graduate School’s policy, you need to submit your official transcripts/marksheets and the degree certificate to UH Graduate School in a sealed envelope with the university stamp, via a carrier such as the postal service or DHL or FedEx, etc. Otherwise, Prior Degree Verification (PDV) Hold will not be removed. You may bring those officials with you and hand deliver them to the Graduate School Office upon your arrival but you will not be able to register for courses in advance. If you wish to temporarily remove the PDV Hold, a petition must be filed with Graduate School with an official letter from your university indicating the reasons why you cannot get your official documents in time. Petition: Graduate Student General Petition

I am transferring from another institution in the US. Can I transfer credits for some courses I have already taken? As per engineering graduate program policies, we can transfer up to 6 credit hours or 2 courses. Please send us a list of courses that you took at your previous institution and the grades you received for each course. Based on the course content and equivalency with courses in our engineering data science degree plan, we will make a decision about whether the course can be transferred or not. Once a course has been approved, to complete the credit transfer, you need to submit a petition. Petition: Graduate Student General Petition

Do I have to take the TOEFL? If you are an international student and English is not your native language, you also need to submit proof of English proficiency such as TOEFL or IELTS or Duolingo test result. You can upload the unofficial test result for evaluation first and ask the testing center to send the official test result to UH electronically. English Language Proficiency Requirement

Is GRE required? GRE is optional for Fall 2023.

How long should my essay be? There is no limitation for essays. It can be just as much as you need to describe your background, interest, experience, and relevant information. Typically it is 1.5-2 pages.

Certificate in Engineering Data Science
High Performance Computing

Faculty Directory
Signature Initiatives
Innovation + Entrepreneurship
Commercialization
Collaboration
Undergraduate Research
Centers + Institutes
Laboratories + Core Facilities
University Infrastructure
Research + IP Resources (PennKey)
Make a Gift
Current Students

Data Science

Innovation in Data Engineering and Science (IDEAS)

IDEAS will commit $60 million in resources for faculty hiring and research in the areas of data-driven scientific discovery and experimentation, design and engineering of safe, explainable and trustable autonomous systems, and data science for neuro engineering and bio-inspired computing.

IDEAS Faculty Search in Artificial Intelligence, Data Science and Machine Learning

The School of Engineering and Applied Science at the University of Pennsylvania is growing its faculty thanks to a major $750M investment in science, engineering and medicine. As part of this initiative, the Center for Innovation in Data Engineering and Science (IDEAS) is engaged in an aggressive hiring effort for multiple tenured or tenure-track faculty positions in Artificial Intelligence, Data Science and Machine Learning. Areas of interest include but are not limited to:

Foundations of AI/ML/DS : understanding the mathematical foundations of AI/ML/DS to enable the development of the next generation of data-driven methods.
Scientific AI/ML/DS : data-driven approaches that can transform scientific discovery and modeling of new phenomena across engineering and science.
Bio-inspired AI/ML/DS: developing new paradigms for bridging the gap between human and machine learning, bio-inspired computing, and AI in health.
Trustworthy AI/ML/DS : design and engineering of fair, ethical, explainable, robust, safe, and trustworthy autonomous systems.

Post-Doctoral Positions

IDEAS is also engaged in hiring efforts for Post-Doctoral Positions in AI, Data Engineering, and Science.

René Vidal , Rachleff and Penn Integrates Knowledge University Professor in the Department of Electrical and Systems Engineering in the School of Engineering and Applied Science, and the Department of Radiology in the Perelman School of Medicine, is IDEAS inaugural director. Vidal ‘s research focuses on the mathematical foundations of deep learning and its applications in computer vision and biomedical data science. In addition to bridging data engineering and science activities across all of Penn’s schools, IDEAS is actively recruiting core faculty with expertise in those respective areas.

Amy Gutmann Hall is the new home for data science at Penn, serving as a hub for infusing data-driven approaches into every department and sparking new collaborations, both with industry partners and the next generation of data scientists in Philadelphia’s public schools.

Affiliated Faculty:

Shivani Agarwal
Rajeev Alur
Dani S. Bassett
Chris Callison-Burch
Damon Centola
Pratik Chaudhari
Konstantinos Daniilidis
Susan B. Davidson
Thomas Farmer
Robert Ghrist
Justin Gottschlich
Sharath Chandra Guntuku
Andreas Haeberlen
Hamed Hassani
Brett Hemenway
Zachary Ives
Dinesh Jayaraman
Kevin B. Johnson
Michael Kearns
Konrad Paul Kording
Boon Thau Loo
Nikolai Matni
Paris Perdikaris
Jennifer E. Phillips-Cremins
Linh Thi Xuan Phan
Victor M. Preciado
Alejandro Ribeiro
Robert Riggleman
Shirin Saeedi Bidokhti
Saswati Sarkar
Camillo Jose Taylor
Lyle H. Ungar
Rakesh V. Vohra
Aleksandra Vojvodic
Duncan J. Watts

IDEAS Faculty Search in Artificial Intelligence, Data Science and Machine Learning.

IDEAS Post-Doctoral Positions in AI, Data Engineering and Science.

Latest News:

Penn Engineering Announces First Ivy League Undergraduate Degree in Artificial Intelligence
‘Topping Off’ Amy Gutmann Hall
Penn Launches New Center for Quantum Information, Engineering, Science and Technology (Penn QUIEST)

Utility Menu

Data Engineering for Everyone

939 KB

arXiv Version

Filter Pubs By Research Area

Artificial Intelligence (1)
Machine Learning (10)
Mobile (22)
Reliability (40)
Robotics (16)
Runtimes (22)

Filter By Project

Accelerator Level Parallelism (7)
Active Timing Margin (4)
Air Learning (3)
Dynamic Optimization (9)
MAV Bench (2)
ML Runtimes (5)
Mobile Malware (5)
Process Level Redundancy (4)
Program Introspection (Pin) (8)
Simulation and Modeling (5)
Source Seeking (1)
User Experience (2)
Voltage Noise (24)
Watt Wise Web (8)

Filter Pubs by Year:

Forthcoming (1)

Filter Pubs by Author:

Aamodt, Tor (1)
Adolf, Robert (1)
Ahmed, Tahrina (1)
Almeida, Marcelino (2)
Anderson, B. (1)
Azad, Zahra (3)
Bailis, Peter (3)
Banbury, Colby R. (3)
Banbury, Colby (3)
Barth-Maron, Gabriel (1)

Filter Pubs by Type:

Conference Paper (61)
Conference Proceedings (2)
Journal Article (34)
Miscellaneous (16)
Web Article (3)

Home » Research » Data Science in Engineering

Data Science in Engineering

Revolutionary advances in the ability to capture data at massive scales and to extract actionable information have had a transformative impact on the world. Technology has capitalized on these developments to fuel new industries and to reshape society in fundamental ways, accelerating scientific discovery and guiding engineering design across a broad cross-section of application domains.

A water sensor computer chip that measures moisture levels in soil and can be embedded in plant stems for accurate information on water stress, developed by Abraham Stroock, professor of chemical and biomolecular engineering (CBE). (photo by Lindsay France, Cornell University)

Cornell Engineering is uniquely positioned to educate the trans-disciplinary engineers of the future by providing a data science-infused perspective across the engineering landscape for a truly 21st-century education. Additionally, radical collaborations with Cornell Tech, Weill Cornell Medicine, and other colleges will exploit the interplay between data-driven decisions and other disciplines to propel advances in autonomous vehicles, materials discovery, medicine and biomedical devices.

Photo by Jason Koski, Cornell University.

Home » Research Data – Types Methods and Examples

Research Data – Types Methods and Examples

Table of Contents

Research Data

Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question.

It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in scientific inquiry and is often subject to rigorous analysis, interpretation, and dissemination to advance knowledge and inform decision-making.

Types of Research Data

There are generally four types of research data:

Quantitative Data

This type of data involves the collection and analysis of numerical data. It is often gathered through surveys, experiments, or other types of structured data collection methods. Quantitative data can be analyzed using statistical techniques to identify patterns or relationships in the data.

Qualitative Data

This type of data is non-numerical and often involves the collection and analysis of words, images, or sounds. It is often gathered through methods such as interviews, focus groups, or observation. Qualitative data can be analyzed using techniques such as content analysis, thematic analysis, or discourse analysis.

Primary Data

This type of data is collected by the researcher directly from the source. It can include data gathered through surveys, experiments, interviews, or observation. Primary data is often used to answer specific research questions or to test hypotheses.

Secondary Data

This type of data is collected by someone other than the researcher. It can include data from sources such as government reports, academic journals, or industry publications. Secondary data is often used to supplement or support primary data or to provide context for a research project.

Research Data Formates

There are several formats in which research data can be collected and stored. Some common formats include:

Text : This format includes any type of written data, such as interview transcripts, survey responses, or open-ended questionnaire answers.
Numeric : This format includes any data that can be expressed as numerical values, such as measurements or counts.
Audio : This format includes any recorded data in an audio form, such as interviews or focus group discussions.
Video : This format includes any recorded data in a video form, such as observations of behavior or experimental procedures.
Images : This format includes any visual data, such as photographs, drawings, or scans of documents.
Mixed media: This format includes any combination of the above formats, such as a survey response that includes both text and numeric data, or an observation study that includes both video and audio recordings.
Sensor Data: This format includes data collected from various sensors or devices, such as GPS, accelerometers, or heart rate monitors.
Social Media Data: This format includes data collected from social media platforms, such as tweets, posts, or comments.
Geographic Information System (GIS) Data: This format includes data with a spatial component, such as maps or satellite imagery.
Machine-Readable Data : This format includes data that can be read and processed by machines, such as data in XML or JSON format.
Metadata: This format includes data that describes other data, such as information about the source, format, or content of a dataset.

Data Collection Methods

Some common research data collection methods include:

Surveys : Surveys involve asking participants to answer a series of questions about a particular topic. Surveys can be conducted online, over the phone, or in person.
Interviews : Interviews involve asking participants a series of open-ended questions in order to gather detailed information about their experiences or perspectives. Interviews can be conducted in person, over the phone, or via video conferencing.
Focus groups: Focus groups involve bringing together a small group of participants to discuss a particular topic or issue in depth. The group is typically led by a moderator who asks questions and encourages discussion among the participants.
Observations : Observations involve watching and recording behaviors or events as they naturally occur. Observations can be conducted in person or through the use of video or audio recordings.
Experiments : Experiments involve manipulating one or more variables in order to measure the effect on an outcome of interest. Experiments can be conducted in a laboratory or in the field.
Case studies: Case studies involve conducting an in-depth analysis of a particular individual, group, or organization. Case studies typically involve gathering data from multiple sources, including interviews, observations, and document analysis.
Secondary data analysis: Secondary data analysis involves analyzing existing data that was collected for another purpose. Examples of secondary data sources include government records, academic research studies, and market research reports.

Analysis Methods

Some common research data analysis methods include:

Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.
Inferential statistics: Inferential statistics involve using statistical techniques to draw conclusions about a population based on a sample of data. Inferential statistics are often used to test hypotheses and determine the statistical significance of relationships between variables.
Content analysis : Content analysis involves analyzing the content of text, audio, or video data to identify patterns, themes, or other meaningful features. Content analysis is often used in qualitative research to analyze open-ended survey responses, interviews, or other types of text data.
Discourse analysis: Discourse analysis involves analyzing the language used in text, audio, or video data to understand how meaning is constructed and communicated. Discourse analysis is often used in qualitative research to analyze interviews, focus group discussions, or other types of text data.
Grounded theory : Grounded theory involves developing a theory or model based on an analysis of qualitative data. Grounded theory is often used in exploratory research to generate new insights and hypotheses.
Network analysis: Network analysis involves analyzing the relationships between entities, such as individuals or organizations, in a network. Network analysis is often used in social network analysis to understand the structure and dynamics of social networks.
Structural equation modeling: Structural equation modeling involves using statistical techniques to test complex models that include multiple variables and relationships. Structural equation modeling is often used in social science research to test theories about the relationships between variables.

Purpose of Research Data

Research data serves several important purposes, including:

Supporting scientific discoveries : Research data provides the basis for scientific discoveries and innovations. Researchers use data to test hypotheses, develop new theories, and advance scientific knowledge in their field.
Validating research findings: Research data provides the evidence necessary to validate research findings. By analyzing and interpreting data, researchers can determine the statistical significance of relationships between variables and draw conclusions about the research question.
Informing policy decisions: Research data can be used to inform policy decisions by providing evidence about the effectiveness of different policies or interventions. Policymakers can use data to make informed decisions about how to allocate resources and address social or economic challenges.
Promoting transparency and accountability: Research data promotes transparency and accountability by allowing other researchers to verify and replicate research findings. Data sharing also promotes transparency by allowing others to examine the methods used to collect and analyze data.
Supporting education and training: Research data can be used to support education and training by providing examples of research methods, data analysis techniques, and research findings. Students and researchers can use data to learn new research skills and to develop their own research projects.

Applications of Research Data

Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of research data can be broadly classified into the following categories:

Academic research: Research data is widely used in academic research to test hypotheses, develop new theories, and advance scientific knowledge. Researchers use data to explore complex relationships between variables, identify patterns, and make predictions.
Business and industry: Research data is used in business and industry to make informed decisions about product development, marketing, and customer engagement. Data analysis techniques such as market research, customer analytics, and financial analysis are widely used to gain insights and inform strategic decision-making.
Healthcare: Research data is used in healthcare to improve patient outcomes, develop new treatments, and identify health risks. Researchers use data to analyze health trends, track disease outbreaks, and develop evidence-based treatment protocols.
Education : Research data is used in education to improve teaching and learning outcomes. Data analysis techniques such as assessments, surveys, and evaluations are used to measure student progress, evaluate program effectiveness, and inform policy decisions.
Government and public policy: Research data is used in government and public policy to inform decision-making and policy development. Data analysis techniques such as demographic analysis, cost-benefit analysis, and impact evaluation are widely used to evaluate policy effectiveness, identify social or economic challenges, and develop evidence-based policy solutions.
Environmental management: Research data is used in environmental management to monitor environmental conditions, track changes, and identify emerging threats. Data analysis techniques such as spatial analysis, remote sensing, and modeling are used to map environmental features, monitor ecosystem health, and inform policy decisions.

Advantages of Research Data

Research data has numerous advantages, including:

Empirical evidence: Research data provides empirical evidence that can be used to support or refute theories, test hypotheses, and inform decision-making. This evidence-based approach helps to ensure that decisions are based on objective, measurable data rather than subjective opinions or assumptions.
Accuracy and reliability : Research data is typically collected using rigorous scientific methods and protocols, which helps to ensure its accuracy and reliability. Data can be validated and verified using statistical methods, which further enhances its credibility.
Replicability: Research data can be replicated and validated by other researchers, which helps to promote transparency and accountability in research. By making data available for others to analyze and interpret, researchers can ensure that their findings are robust and reliable.
Insights and discoveries : Research data can provide insights into complex relationships between variables, identify patterns and trends, and reveal new discoveries. These insights can lead to the development of new theories, treatments, and interventions that can improve outcomes in various fields.
Informed decision-making: Research data can inform decision-making in a range of fields, including healthcare, business, education, and public policy. Data analysis techniques can be used to identify trends, evaluate the effectiveness of interventions, and inform policy decisions.
Efficiency and cost-effectiveness: Research data can help to improve efficiency and cost-effectiveness by identifying areas where resources can be directed most effectively. By using data to identify the most promising approaches or interventions, researchers can optimize the use of resources and improve outcomes.

Limitations of Research Data

Research data has several limitations that researchers should be aware of, including:

Bias and subjectivity: Research data can be influenced by biases and subjectivity, which can affect the accuracy and reliability of the data. Researchers must take steps to minimize bias and subjectivity in data collection and analysis.
Incomplete data : Research data can be incomplete or missing, which can affect the validity of the findings. Researchers must ensure that data is complete and representative to ensure that their findings are reliable.
Limited scope: Research data may be limited in scope, which can limit the generalizability of the findings. Researchers must carefully consider the scope of their research and ensure that their findings are applicable to the broader population.
Data quality: Research data can be affected by issues such as measurement error, data entry errors, and missing data, which can affect the quality of the data. Researchers must ensure that data is collected and analyzed using rigorous methods to minimize these issues.
Ethical concerns: Research data can raise ethical concerns, particularly when it involves human subjects. Researchers must ensure that their research complies with ethical standards and protects the rights and privacy of human subjects.
Data security: Research data must be protected to prevent unauthorized access or use. Researchers must ensure that data is stored and transmitted securely to protect the confidentiality and integrity of the data.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Information in Research – Types and Examples

Secondary Data – Types, Methods and Examples

Primary Data – Types, Methods and Examples

Qualitative Data – Types, Methods and Examples

Quantitative Data – Types, Methods and Examples

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

How to Build a Portfolio of Data Engineering Projects

In the realm of data-driven insights and innovation, data engineering emerges as a cornerstone discipline, empowering organizations to harness the true potential of their information assets. Aspiring data engineers, eager to establish a strong foothold in this competitive landscape, often grapple with the challenge of building a compelling portfolio that showcases their skills and expertise. This comprehensive guide aims to illuminate the path to success, providing actionable steps and invaluable insights to curate a data engineering project portfolio that stands out from the crowd.

Understanding the Significance of a Data Engineering Portfolio

In essence, a data engineering portfolio serves as a tangible representation of an individual’s capabilities, demonstrating their proficiency in designing, constructing, and maintaining the intricate pipelines that facilitate the seamless flow of data within an organization. It goes beyond mere technical prowess, encapsulating an individual’s problem-solving abilities, creativity, and their capacity to translate complex business requirements into robust and scalable data solutions. A well-crafted portfolio acts as a powerful testament to one’s dedication to the field, showcasing a commitment to continuous learning and a passion for unraveling the intricacies of data engineering.

Laying the Foundation: Essential Skills and Knowledge

Before delving into the intricacies of project selection and execution, it is imperative to cultivate a strong foundation in the core principles and practices that underpin data engineering. This entails acquiring proficiency in various programming languages, such as Python, Java, or Scala, which serve as the building blocks for crafting data pipelines. A solid grasp of database management systems, including both relational and NoSQL databases, is equally crucial, as data engineers frequently interact with these repositories to store, retrieve, and manipulate data.

Furthermore, familiarity with distributed computing frameworks, such as Apache Spark or Apache Hadoop, empowers data engineers to process massive volumes of data efficiently, leveraging the power of parallel processing. Cloud computing platforms, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP), have also become indispensable tools in the modern data engineering landscape, enabling seamless scalability and cost-effective resource management.

Project Ideation: Finding Inspiration and Identifying Opportunities

Once equipped with the essential skills and knowledge, aspiring data engineers can embark on the exciting journey of project ideation. Inspiration can be drawn from a multitude of sources, including real-world challenges encountered in various industries, open-source datasets available on platforms like Kaggle or UCI Machine Learning Repository, or even personal interests and passions. It is crucial to choose projects that align with one’s career aspirations and showcase a diverse range of skills, encompassing data ingestion, transformation, storage, and visualization.

Project Execution: Translating Ideas into Reality

The execution phase of a data engineering project demands meticulous planning, careful consideration of architectural choices, and effective communication with stakeholders. It typically involves the following key stages:

Data Collection and Ingestion: Identifying relevant data sources, extracting data from various formats (structured, semi-structured, or unstructured), and loading it into a suitable storage system.
Data Cleaning and Transformation: Addressing data quality issues, handling missing values, transforming data into a consistent format, and performing necessary aggregations or calculations.

Data Storage and Management: Designing an appropriate data storage architecture, choosing a database or data lake solution, and implementing efficient indexing and partitioning strategies.

Data Processing and Analysis: Leveraging distributed computing frameworks or cloud-based services to process and analyze data at scale, applying machine learning algorithms or statistical models to extract insights.

Data Visualization and Reporting: Creating interactive dashboards or reports to present findings in a visually appealing and easily understandable manner, enabling stakeholders to make informed decisions.

Throughout the execution phase, it is imperative to document the project meticulously, capturing key decisions, challenges encountered, and lessons learned. This documentation serves as a valuable resource for future reference and enhances the portfolio’s overall credibility.

Showcase Your Work: Building an Impressive Portfolio

Once the project is complete, it’s time to showcase your accomplishments in a compelling portfolio. A well-structured portfolio typically comprises the following elements:

Project Summary: A concise overview of the project, highlighting the problem statement, objectives, and key outcomes.

Technical Details: A detailed description of the technologies, tools, and frameworks employed in the project, showcasing your technical expertise.

Data Flow Diagram: A visual representation of the data pipeline, illustrating the various stages of data ingestion, transformation, storage, and analysis.

Code Samples: Snippets of code that demonstrate your proficiency in programming languages and your ability to write clean and efficient code.

Results and Insights: A summary of the key findings and insights derived from the project, emphasizing the impact and value generated.

Lessons Learned: A reflective analysis of the challenges encountered during the project and the lessons learned, showcasing your capacity for continuous improvement.

Beyond the Basics: Enhancing Your Portfolio

To make your portfolio stand out from the crowd, consider incorporating the following elements:

Real-World Impact: Quantify the impact of your project, highlighting specific metrics or KPIs that demonstrate tangible business value.

Collaboration and Teamwork: If you worked on a team project, emphasize your ability to collaborate effectively with others, highlighting your communication and interpersonal skills.

Open-Source Contributions: If you have contributed to open-source projects or communities, showcase your commitment to knowledge sharing and collaboration.

Blogging or Technical Writing: Share your insights and experiences through blog posts or technical articles, demonstrating your thought leadership and communication skills.

Personal Branding: Cultivate a strong online presence through platforms like LinkedIn or GitHub, showcasing your professional profile and connecting with other data engineers.

Conclusion: Embrace the Journey of Continuous Learning

Building a compelling data engineering portfolio is an ongoing process that requires dedication, perseverance, and a passion for continuous learning. As the field of data engineering evolves at a rapid pace, it is crucial to stay abreast of emerging technologies and industry trends, constantly expanding your skill set and knowledge base.

By following the steps outlined in this guide, aspiring data engineers can embark on a fulfilling journey of project ideation, execution, and presentation, culminating in a portfolio that reflects their true potential and opens doors to exciting career opportunities. Remember, the pursuit of excellence in data engineering is a lifelong endeavor, and each project undertaken contributes to a tapestry of knowledge and expertise that will shape your future success.

Examples of Data Engineering Projects for Your Portfolio

To spark your creativity and provide concrete examples, here are a few project ideas that can be tailored to your specific interests and skill level:

Building a Data Warehouse: Design and implement a data warehouse to consolidate data from disparate sources, enabling efficient reporting and analytics.

Real-Time Data Streaming: Develop a real-time data streaming pipeline using Apache Kafka or similar technologies to process and analyze data as it arrives.

Machine Learning Model Deployment: Build and deploy a machine learning model to production, utilizing cloud-based services or containerization technologies.

Data Lake Implementation: Create a data lake using cloud storage and data cataloging tools to store and manage large volumes of raw data.

ETL Pipeline Development: Design and implement an ETL (Extract, Transform, Load) pipeline to move data between different systems and transform it into a usable format.

Remember, these are just a few examples, and the possibilities are endless. The key is to choose projects that challenge you, showcase your skills, and align with your career goals.

Additional Tips for Building a Successful Data Engineering Portfolio

Start Small: Don’t feel pressured to tackle complex projects right away. Start with smaller, manageable projects that allow you to build confidence and gain experience.

Seek Mentorship: Connect with experienced data engineers who can offer guidance and support as you build your portfolio.

Network with Peers: Attend industry events, join online communities, and connect with other data engineers to share knowledge and learn from each other.

Stay Curious: Embrace a mindset of continuous learning, explore new technologies, and experiment with different approaches to data engineering challenges.

Be Passionate: Let your passion for data engineering shine through in your portfolio. Showcase your enthusiasm for the field and your commitment to excellence.

By following these guidelines and infusing your portfolio with your unique personality and aspirations, you can create a compelling narrative that captures the attention of potential employers and sets you on the path to a rewarding career in data engineering.

Active funding opportunity

Nsf 23-614: smart health and biomedical research in the era of artificial intelligence and advanced data science (sch), program solicitation, document information, document history.

Posted: August 14, 2023
Replaces: NSF 21-530

Program Solicitation NSF 23-614

		Directorate for Computer and Information Science and Engineering Division of Information and Intelligent Systems Division of Computer and Network Systems Division of Computing and Communication Foundations Directorate for Engineering Division of Civil, Mechanical and Manufacturing Innovation Directorate for Mathematical and Physical Sciences Division of Mathematical Sciences Directorate for Social, Behavioral and Economic Sciences Division of Behavioral and Cognitive Sciences
		National Institutes of Health Office of Data Science Strategy Office of Behavioral and Social Sciences Research National Center for Complementary and Integrative Health National Cancer Institute National Eye Institute National Institute on Aging National Institute of Allergy and Infectious Diseases National Institute of Arthritis and Musculoskeletal and Skin Disease National Institute of Biomedical Imaging and Bioengineering Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institute on Drug Abuse National Institute on Deafness and Other Communication Disorders National Institute of Dental and Craniofacial Research National Institute of Diabetes and Digestive and Kidney Diseases National Institute of Environmental Health Sciences National Institute of Mental Health National Institute of Neurological Disorders and Stroke National Institute of Nursing Research National Library of Medicine National Heart, Lung, and Blood Institute National Institute on Minority Health and Health Disparities National Center for Advancing Translational Sciences Office of Disease Prevention Office of Nutrition Research Office of Research on Women's Health Office of Science Policy Sexual and Gender Minority Research Office Office of AIDS Research

Full Proposal Deadline(s) (due by 5 p.m. submitting organization’s local time):

November 09, 2023

October 03, 2024

October 3, Annually Thereafter

Important Information And Revision Notes

Themes areas have been revised;
Changes have been made in participating National Institutes of Health’s Institutes and Centers; and,
Proposal deadlines have been revised.

Any proposal submitted in response to this solicitation should be submitted in accordance with the NSF Proposal & Award Policies & Procedures Guide (PAPPG) that is in effect for the relevant due date to which the proposal is being submitted. The NSF PAPPG is regularly revised and it is the responsibility of the proposer to ensure that the proposal meets the requirements specified in this solicitation and the applicable version of the PAPPG. Submitting a proposal prior to a specified deadline does not negate this requirement.

Summary Of Program Requirements

General information.

Program Title:

Smart Health and Biomedical Research in the Era of Artificial Intelligence and Advanced Data Science (SCH)

The purpose of this interagency program solicitation is to support the development of transformative high-risk, high-reward advances in computer and information science, engineering, mathematics, statistics, behavioral and/or cognitive research to address pressing questions in the biomedical and public health communities. Transformations hinge on scientific and engineering innovations by interdisciplinary teams that develop novel methods to intuitively and intelligently collect, sense, connect, analyze and interpret data from individuals, devices and systems to enable discovery and optimize health. Solutions to these complex biomedical or public health problems demand the formation of interdisciplinary teams that are ready to address these issues, while advancing fundamental science and engineering.

Cognizant Program Officer(s):

Please note that the following information is current at the time of publishing. See program website for any updates to the points of contact.

Goli Yamini, Directorate for Computer and Information Science and Engineering, Division of Information and Intelligent Systems, telephone: 703-292-5367, email: [email protected]

Thomas Martin, Directorate for Computer and Information Science and Engineering, Division of Information and Intelligent Systems, telephone: 703-292-2170, email: [email protected]

Steven J. Breckler, Social, Behavioral and Economic Sciences Directorate, Division of Behavioral and Cognitive Sciences, telephone: 703-292-7369, email: [email protected]

Yulia Gel, Mathematics and Physical Sciences Directorate, Division of Mathematical Sciences, telephone: 703-292-7888, email: [email protected]

Georgia-Ann Klutke, Directorate for Engineering, Division of Civil, Mechanical and Manufacturing Innovation, telephone: 703-292-2443, email: [email protected]

Tatiana Korelsky, Directorate for Computer and Information Science and Engineering, Division of Information and Intelligent Systems, telephone: 703-292-8930, email: [email protected]

Shivani Sharma, Directorate for Engineering, Division of Civil, Mechanical and Manufacturing Innovation, telephone: 703-292-4204, email: [email protected]

Vishal Sharma, Directorate for Computer and Information Science and Engineering, Division of Computer and Network Systems, telephone: 703-292-8950, email: [email protected]

Sylvia Spengler, Directorate for Computer and Information Science and Engineering, Division of Information and Intelligent Systems, telephone: 703-292-8930, email: [email protected]

Betty K. Tuller, Social, Behavioral and Economic Sciences Directorate, Division of Behavioral and Cognitive Sciences, telephone: 703-292-7238, email: [email protected]

Christopher C. Yang, Directorate for Computer and Information Science and Engineering, Division of Information and Intelligent Systems, telephone: 703-292-8111, email: [email protected]

James E. Fowler, Computer and Information Science and Engineering, Computing and Communication Foundations, telephone: 703-292-8910, email: [email protected]

Dana Wolff-Hughes, National Cancer Institute (NCI), telephone: 240-620-0673, email: [email protected]

James Gao, National Eye Institute (NEI), NIH, telephone: 301-594-6074, email: [email protected]

Julia Berzhanskaya, National Heart, Lung, and Blood Institute (NHLBI), NIH, telephone: 301-443-3707, email: [email protected]

Lyndon Joseph, National Institute on Aging (NIA), telephone: 301-496-6761, email: [email protected]

Aron Marquitz, National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), telephone: 301-435-1240, email: [email protected]

Qi Duan, National Institute of Biomedical Imaging and Bioengineering (NIBIB), NIH, telephone: 301-827-4674, email: [email protected]

Samantha Calabrese, Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), NIH, telephone: 301-827-7568, email: [email protected]

Susan Wright, National Institute on Drug Abuse (NIDA), NIH, telephone: 301-402-6683, email: [email protected]

Roger Miller, National Institute on Deafness and Other Communication Disorders (NIDCD), NIH, telephone: 301-402-3458, email: [email protected]

Noffisat Oki, National Institute of Dental and Craniofacial Research (NIDCR), telephone: 301-402-6778, email: [email protected]

Xujing Wang, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), NIH, telephone: 301-451-2682, email: [email protected]

Christopher Duncan, National Institute of Environmental Health Sciences (NIEHS), NIH, telephone: 984-287-3256, email: [email protected]

David Leitman, National Institute of Mental Health (NIMH), telephone: 301-827-6131, email: [email protected]

Deborah Duran, National Institute on Minority Health and Health Disparities (NIMHD), NIH, telephone: 301-594-9809, email: [email protected]

Leslie C. Osborne, National Institute of Neurological Disorders and Stroke (NINDS), telephone: 240-921-135, email: [email protected]

Bill Duval, National Institute of Nursing Research (NINR), NIH, telephone: 301-435-0380, email: [email protected]

Goutham Reddy, National Library of Medicine (NLM), telephone: 301-827-6728, email: [email protected]

Emrin Horguslouglu, National Center for Complementary and Integrative Health (NCCIH), telephone: 240-383-5302, email: [email protected]

Christopher Hartshorn, National Center for Advancing Translational Sciences (NCATS), telephone: 301-402-0264, email: [email protected]

Joseph Monaco, Brain Research Through Advancing Innovative Neurotechnologies (BRAIN) Initiative, telephone: 301-402-3823, email: [email protected]

Robert Cregg, Office of AIDS Research (OAR), telephone: 301-761-7557, email: [email protected]

Jacqueline Lloyd, Office of Disease Prevention (ODP), telephone: 301-827-5559, email: [email protected]

Fenglou Mao, Office of Data Science Strategy (ODSS), Division of Program Coordination, Planning, and Strategic Initiatives (DPCPSI), telephone: 301-451-9389, email: [email protected]

Nicholas Jury, Office of Nutrition Research (ONR), telephone: 301-827-1234, email: [email protected]

Jamie White, Office of Research on Women’s Health (ORWH), telephone: 301-496-9200, email: [email protected]

Adam Berger, Office of Science Policy (OSP), telephone: 301-827-9676, email: [email protected]

Christopher Barnhart, Sexual and Gender Minority Research Office (SGMRO), telephone: 301-594-8983, email: [email protected]

47.041 --- Engineering
47.049 --- Mathematical and Physical Sciences
47.070 --- Computer and Information Science and Engineering
47.075 --- Social Behavioral and Economic Sciences
93.173 --- National Institute on Deafness and Other Communication Disorders
93.213 --- National Center for Complementary and Integrative Health
93.242 --- National Institute of Mental Health
93.279 --- National Institute on Drug Abuse
93.286 --- National Institute of Biomedical Imaging and Bioengineering
93.310 --- NIH Office of Data Science
93.350 --- National Center for Advancing Translational Sciences
93.361 --- National Institute of Nursing Research
93.396 --- National Cancer Institute
93.846 --- National Institute of Arthritis and Musculoskeletal and Skin Disease
93.847 --- National Institute of Diabetes and Digestive and Kidney Diseases
93.853 --- National Institute of Neurological Disorders and Stroke
93.865 --- Eunice Kennedy Shriver National Institute of Child Health and Human Development
93.866 --- National Institute on Aging
93.867 --- National Eye Institute
93.879 --- National Library of Medicine

Award Information

Anticipated Type of Award: Standard Grant or Continuing Grant or Cooperative Agreement

Estimated Number of Awards: 10 to 16

per year, subject to the availability of funds.

Projects will be funded for up to four years for a total of $1,200,000 ($300,000 per year).

Anticipated Funding Amount: $15,000,000 to $20,000,000

will be invested in proposals submitted to this solicitation in each year of the solicitation, subject to the availability of funds and the quality of the proposals received.

Eligibility Information

Who May Submit Proposals:

Proposals may only be submitted by the following: Institutions of Higher Education (IHEs) - Two- and four-year IHEs (including community colleges) accredited in, and having a campus located in the US, acting on behalf of their faculty members. Special Instructions for International Branch Campuses of US IHEs: If the proposal includes funding to be provided to an international branch campus of a US institution of higher education (including through use of subawards and consultant arrangements), the proposer must explain the benefit(s) to the project of performance at the international branch campus, and justify why the project activities cannot be performed at the US campus. Non-profit, non-academic organizations: Independent museums, observatories, research laboratories, professional societies and similar organizations located in the U.S. that are directly associated with educational or research activities.

Who May Serve as PI:

There are no restrictions or limits.

Limit on Number of Proposals per Organization:

Limit on Number of Proposals per PI or co-PI: 2

In each annual competition, an investigator may participate as Principal Investigator (PI), co-Principal Investigator (co-PI), Project Director (PD), Senior/Key Personnel or Consultant in no more than two proposals submitted in response to each annual deadline in this solicitation. These eligibility constraints will be strictly enforced in order to treat everyone fairly and consistently . In the event that an individual exceeds this limit, proposals received within the limit will be accepted based on earliest date and time of proposal submission (i.e., the first two proposals received will be accepted, and the remainder will be returned without review). No exceptions will be made. Proposals submitted in response to this solicitation may not duplicate or be substantially similar to other proposals concurrently under consideration by NSF or NIH programs or study sections. Duplicate or substantially similar proposals will be returned without review. NIH will not accept any application that is essentially the same as one already reviewed within the past 37 months (as described in the NIH Grants Policy Statement ), except for submission: To an NIH Requests for Applications (RFA) of an application that was submitted previously as an investigator-initiated application but not paid; Of an NIH investigator-initiated application that was originally submitted to an RFA but not paid; or Of an NIH application with a changed grant activity code.

Proposal Preparation and Submission Instructions

A. proposal preparation instructions.

Letters of Intent: Not required
Preliminary Proposal Submission: Not required

Full Proposals:

Full Proposals submitted via Research.gov: NSF Proposal and Award Policies and Procedures Guide (PAPPG) guidelines apply. The complete text of the PAPPG is available electronically on the NSF website at: https://www.nsf.gov/publications/pub_summ.jsp?ods_key=pappg .
Full Proposals submitted via Grants.gov: NSF Grants.gov Application Guide: A Guide for the Preparation and Submission of NSF Applications via Grants.gov guidelines apply (Note: The NSF Grants.gov Application Guide is available on the Grants.gov website and on the NSF website at: https://www.nsf.gov/publications/pub_summ.jsp?ods_key=grantsgovguide ).

B. Budgetary Information

Cost Sharing Requirements:

Inclusion of voluntary committed cost sharing is prohibited.

Indirect Cost (F&A) Limitations:

For NSF, Proposal & Award Policies & Procedures Guide (PAPPG) Guidelines apply.

For NIH, indirect costs on foreign subawards/subcontracts will be limited to eight (8) percent.

Other Budgetary Limitations:

Other budgetary limitations apply. Please see the full text of this solicitation for further information.

C. Due Dates

Proposal review information criteria.

Merit Review Criteria:

National Science Board approved criteria. Additional merit review criteria apply. Please see the full text of this solicitation for further information.

Award Administration Information

Award Conditions:

Additional award conditions apply. Please see the full text of this solicitation for further information.

Reporting Requirements:

Additional reporting requirements apply. Please see the full text of this solicitation for further information.

I. Introduction

The need for a significant transformation in medical, public health and healthcare delivery approaches has been recognized by numerous organizations and captured in a number of reports. For example, the Networking and Information Technology Research and Development (NITRD) program released the Federal Health Information Technology Research and Development Strategic Framework in 2020 that pointed to an overwhelming need for the integration between the computing, informatics, engineering, mathematics and statistics, behavioral and social science disciplines, and the biomedical, and public health research communities to produce the innovation necessary to improve the health of the country. Recent developments and significant advances in machine learning (ML), artificial intelligence (AI), deep learning, high performance and cloud computing, and availability of new datasets make such integration achievable, as documented in the 2023 updated National Artificial Intelligence Research and Development Strategic Plan .

These anticipated transformations hinge on scientific and engineering innovations by interdisciplinary teams that intelligently collect, connect, analyze and interpret data from individuals, devices, and systems to enable discovery and optimize health. Technical challenges include a range of issues, including effective data generation, analysis and automation for a range of biomedical devices (from imaging through medical devices) and systems (e.g., electronic health records) and consumer devices (including the Internet of Things), as well as new technology to generate knowledge. Underlying these challenges are many fundamental scientific and engineering issues that require investment in interdisciplinary research to actualize the transformations, which is the goal of this solicitation.

II. Program Description

This interagency solicitation is a collaboration between NSF and the NIH. The Smart Health program supports innovative, high-risk/high-reward research with the promise of disruptive transformations in biomedical and public health research, which can only be achieved by well-coordinated, convergent, and interdisciplinary approaches that draw from multiple domains of computer and information science, engineering, mathematical sciences and the biomedical, social, behavioral, and economic sciences. Therefore, the work to be funded by this solicitation must make fundamental scientific or engineering contributions to two or more disciplines , such as computer or information sciences, engineering, mathematical sciences, statistics, social, behavioral, or cognitive sciences to improve fundamental understanding of human biological, biomedical, public health and/or health-related processes and address a key health problem. The research teams must include members with appropriate and demonstrable expertise in the major areas involved in the work. Traditional disease-centric medical, clinical, pharmacological, biological or physiological studies and evaluations are outside the scope of this solicitation. In addition, fundamental biological research with humans that also does not advance other fundamental science or engineering areas is out of scope for this program. Finally, proposals addressing health indirectly in the education or work environment are also out of scope.

Generating these transformations will require fundamental research and development of new tools, workflows and methods across many dimensions; some of the themes are highlighted below. These themes should be seen as examples and not exhaustive.

1. Fairness and Trustworthiness: Advancing fairness and trustworthiness in modeling in AI/ML is a highly interdisciplinary endeavor. Real world considerations go beyond the analytics and can inform new directions for computational science to better realize the benefits of algorithmic and data fairness and trustworthiness. The complexity of biomedical and health systems requires deeper understanding of causality in AI/ML models; new ways of integrating social and economic data to address disparities and improve equity, such as disease heterogeneity, disease prevention, resilience, and treatment response, while systematically accounting for a broad range of uncertainties; and new insights into human-AI systems for clinical decision support. In general, this thrust supports the conduct of fundamental computational research into theories, techniques, and methodologies that go well beyond today's capabilities and are motivated by challenges and requirements in biomedical applications.

2. Transformative Analytics in Biomedical and Behavioral Research : As biomedical and behavioral research continues to generate large amount of multi-level and multi-scale data (e.g., clinical, imaging, personal, social, contextual, environmental, and organizational data), challenges remain. New development in areas such as artificial intelligence and machine learning (AI/ML), natural language technologies (NLT), mathematics and statistics and/or quantum information science (QIS) also bring opportunities to address important biomedical and behavioral problems. This theme will support efforts to push forward the current frontline of AI/ML and advanced analytics for biomedical and behavioral research including:

novel data reduction methods;
new robust knowledge representations, visualizations, reasoning algorithms, optimization, modeling and inference methods to support development of innovative models for the study of health and disease;
new computational approaches with provable mathematical guarantees for fusion and analysis of complex behavioral, biomedical and image data to improve inference accuracy, especially in scenarios of noisy and limited data records;
novel explainable and interpretable AI/ML model development;
advanced data management systems with the capability to deal with security, privacy and provenance issues;
novel data systems to build a connected and modernized biomedical data ecosystem;
development of novel technologies to extract information from unstructured text data such as clinical notes, radiology and pathology reports;
development of novel simulation and modeling methods to aid in the design and evaluation of new assessments, treatments and medical devices; and
novel QIS approaches to unique challenges in biomedical and behavioral research.

3. Next Generation Multimodal and Reconfigurable Sensing Systems : This theme addresses the need for new multimodal and reconfigurable sensing systems/platforms and analytics to generate predictive and personalized models of health. The next generation of sensor systems for smart health must have just-in-time monitoring of biomarkers from multiple sensor modalities (e.g., electrochemical, electromagnetic, mechanical, optical, acoustic, etc.) interfaced with different platforms (e.g., mobile, wearable, and implantable). Existing sensor systems generally operate either in discrete formats or with limited inter-connectivity, and are limited in accuracy, selectivity, reliability and data throughput. Those limitations can be overcome by integrating heterogeneous sensing modalities and by having field-adaptive reconfigurable sensor microsystems. This theme encourages the design and fabrication of multimodal and/or reconfigurable sensor systems through innovative research on novel functional materials, devices and circuits for sensing or active interrogation of system states, imaging, communications, and computing. Areas of interest include miniaturized sensor microsystems with integrated signal processing and communication functionalities. Another area of interest is multimodal or reconfigurable sensor systems with dramatically reduced power consumption to extend battery lifetime and enable self-powered operation, making the sensor systems suitable for wearable and implantable applications. Other areas of interest include real-time monitoring of analytes and new biorecognition elements that can be reconfigured to target different analytes on-demand. This thrust also requires researchers to integrate data generated by the multimodal sensor systems, as well as data from other sources, such as laboratory generated data (e.g., genomics, proteomics, etc.), patient-reported outcomes, electronic health records, and existing data sources.

4. Cyber-Physical Systems : Development and adoption of automation has lagged in the biomedical and public health communities. Cyber-physical systems (CPS) are controlled systems built from, and dependent upon, the seamless integration of computation and physical components. These data-rich systems enable new and higher degrees of automation and autonomy. Thus, this theme supports work that enables the creation of closed-loop or human-in- the-loop CPS systems to assess, treat and reduce adverse health events or behaviors, with core research areas including control, data analytics, and machine learning including real-time learning for control, autonomy, design, Internet of Things (IoT), networking, privacy, real-time systems, safety, security, and verification. Finally, development of automated technology that can be utilized across a range of settings (e.g., home, primary care, schools, criminal justice system, child welfare agencies, community-based organizations) to optimize the delivery of effective health interventions is also within scope of the theme.

5. Robotics: This theme addresses the need for novel robotics to provide support and/or automation to enhance health, lengthen lifespan and reduce illness, enhance social connectedness and reduce disabilities. The theme encourages research on robotic systems that exhibit significant levels of both computational capability and physical complexity. Robots are defined as intelligence embodied in an engineered construct, with the ability to process information, sense, plan, and move within or substantially alter its working environment. Here intelligence includes a broad class of methods that enable a robot to solve problems or to make contextually appropriate decisions and act upon them. Currently, robotic devices in health have focused on limited areas (e.g., surgical robotics and exoskeletons). This theme welcomes a wide range of robotic areas, as well as research that considers inextricably interwoven questions of intelligence, computation, and embodiment. The next generation of robotic systems for smart health will need to also have to consider human-robot interaction to enhance usability and effectiveness.

6. Biomedical Image interpretation . This theme's goal is to determine how characteristics of human pattern recognition, visual search, perceptual learning, attentional biases, etc. can inform and improve image interpretation. This theme would include using and developing presentation modalities (e.g., pathologists reading optical slides through a microscope vs. digital whole-slide imagery) and identifying the sources of inter- and intra-observer variability. The theme encourages development of models of how multi-modal contextual information (e.g., integrating patient history, omics, etc. with imaging data) changes the perception of complex images. It also supports new methods to exploit experts' implicit knowledge to improve perceptual decision making (e.g., via rapid gist extraction, context-guided search, etc.). Research on optimal methods for conveying 3D (and 4D) information about anatomy and physiology to human observers is also welcome. This theme also supports advances in image data compression algorithm development to enable more efficient data storage.

7. Unpacking Health Disparities and Health Equity . The National Academies of Sciences, Engineering, and Medicine report, Communities in Action: Pathways to Health Equity (2017), offers a broader context to understand health disparities. In this theme, proposals should seek to develop holistic, data-driven AI/ML or mathematical models to address the structural and/or social determinants of health. Proposers can also develop novel and effective strategies to measure, reduce, and mitigate the effects and impacts of discrimination on health outcomes. The theme also supports new interdisciplinary computational and engineering approaches and models to better understand culture, context and person-centered solutions with diverse communities. To generate technology that is usable and effective will require development of new approaches that support users across socio-economic status, digital and health literacy, technology and broadband access, geography, gender, and ethnicity. Finally, the theme supports development of novel methods of distinguishing the complex pathways between and/or among levels of influence and domains as outlined by the National Institute of Minority Health and Health Disparities Research Framework .

The above listed themes are to provide examples for possible research activities that may be supported by this solicitation, but by no means should the proposed research activities be restricted to these themes. These research themes are clearly not mutually exclusive, and a given project may address multiple themes. This solicitation aims to support research activities that complement rather than duplicate the core programs of the NSF Directorates and the NIH Institutes and Centers and the research efforts supported by other agencies such as the Agency for Healthcare Research and Quality.

NSF supports investigation of fundamental research questions with broadly applicable results. The Smart Health program supports research evaluation with humans. Because advancing fundamental science is early-stage research, randomized control trials are not appropriate for this solicitation and will not be funded. Research that has advanced to a stage that requires randomized control trials should be submitted to an agency whose mission is to improve health.

NIH supports research and discovery that improve human health and save lives. This joint program focuses on fundamental research of generalizable, disease-agnostic approaches with broadly applicable results that align with NIH’s Strategic Plan for Data Science and National Institute of Minority Health and Health Disparities Research Framework.

Integrative Innovation:

Proposals submitted to this solicitation must be integrative and undertake research addressing key application areas by solving problems in multiple scientific domains. The work must make fundamental scientific or engineering contributions to two or more disciplines , such as computer or information sciences, engineering, mathematical sciences, social, behavioral, cognitive and/or economic sciences and address a key health problem. For example, these projects are expected to advance understanding of how computing, engineering and mathematics, combined with advances in behavioral and social science research, would support transformations in health, medicine and/or healthcare and improve the quality of life. Projects are expected to include students and postdocs. Project descriptions must be comprehensive and well-integrated, and should make a convincing case that the collaborative contributions of the project team will be greater than the sum of each of their individual contributions. Collaborations with researchers in the health application domains are required. Such collaborations typically involve multiple institutions, but this is not required. Because the successes of collaborative research efforts are known to depend on thoughtful collaboration mechanisms that regularly bring together the various participants of the project, a Collaboration Plan is required for ALL proposals . Projects will be funded for up to a four-year period and for up to a total of $300,000 per year. The proposed budget should be commensurate with the corresponding scope of work. Rationale must be provided to explain why a budget of the requested size is required to carry out the proposed work.

III. Award Information

Estimated program budget, number of awards and average award size/duration are subject to the availability of funds. An estimated 10 to 16 projects will be funded, subject to availability of funds. Up to $15,000,000-20,000,000 of NSF funds will be invested in proposals submitted to this solicitation. The number of awards and program budgets are subject to the availability of funds.

All awards under this solicitation made by NSF will be as grants or cooperative agreements as determined by the supporting agency. All awards under this solicitation made by NIH will be as grants or cooperative agreements.

Scientists from all disciplines are encouraged to participate. Projects will be awarded depending on the availability of funds and with consideration for creating a balanced overall portfolio.

IV. Eligibility Information

V. proposal preparation and submission instructions.

Full Proposal Preparation Instructions : Proposers may opt to submit proposals in response to this Program Solicitation via Research.gov or Grants.gov.

Full Proposals submitted via Research.gov: Proposals submitted in response to this program solicitation should be prepared and submitted in accordance with the general guidelines contained in the NSF Proposal and Award Policies and Procedures Guide (PAPPG). The complete text of the PAPPG is available electronically on the NSF website at: https://www.nsf.gov/publications/pub_summ.jsp?ods_key=pappg . Paper copies of the PAPPG may be obtained from the NSF Publications Clearinghouse, telephone (703) 292-8134 or by e-mail from [email protected] . The Prepare New Proposal setup will prompt you for the program solicitation number.
Full proposals submitted via Grants.gov: Proposals submitted in response to this program solicitation via Grants.gov should be prepared and submitted in accordance with the NSF Grants.gov Application Guide: A Guide for the Preparation and Submission of NSF Applications via Grants.gov . The complete text of the NSF Grants.gov Application Guide is available on the Grants.gov website and on the NSF website at: ( https://www.nsf.gov/publications/pub_summ.jsp?ods_key=grantsgovguide ). To obtain copies of the Application Guide and Application Forms Package, click on the Apply tab on the Grants.gov site, then click on the Apply Step 1: Download a Grant Application Package and Application Instructions link and enter the funding opportunity number, (the program solicitation number without the NSF prefix) and press the Download Package button. Paper copies of the Grants.gov Application Guide also may be obtained from the NSF Publications Clearinghouse, telephone (703) 292-8134 or by e-mail from [email protected] .

In determining which method to utilize in the electronic preparation and submission of the proposal, please note the following:

Collaborative Proposals. All collaborative proposals submitted as separate submissions from multiple organizations must be submitted via Research.gov. PAPPG Chapter II.E.3 provides additional information on collaborative proposals.

See PAPPG Chapter II.D.2 for guidance on the required sections of a full research proposal submitted to NSF. Please note that the proposal preparation instructions provided in this program solicitation may deviate from the PAPPG instructions.

The following information SUPPLEMENTS (note that it does NOT replace) the guidelines provided in the NSF Proposal & Award Policies & Procedures Guide (PAPPG).

Proposal Titles: Proposal titles must begin with SCH , followed by a colon and the title of the project (i.e., SCH: Title ). If you submit a proposal as part of a set of collaborative proposals, the title of the proposal should begin with Collaborative Research followed by a colon, then SCH followed by a colon, and the title. For example, if you are submitting a collaborative set of proposals, then the title of each would be Collaborative Research: SCH: Title.

Proposals from PIs in institutions that have Research in Undergraduate Institutions (RUI) eligibility should have a proposal title that begins with Collaborative Research (if applicable), followed by a colon, then SCH followed by a colon, then RUI followed by a colon, and then the title, for example, Collaborative Research: SCH: RUI: Title.

Project Summary (1 page limit): At the beginning of the Overview section of the Project Summary enter the title of the Smart Health project, the name of the PI and the lead institution. The Project Summary must include three labeled sections: Overview, Intellectual Merit and Broader Impacts. The overview includes a description of the project. Intellectual Merit should describe the transformative research and the potential of the proposed activity to advance knowledge. Broader Impacts should describe the potential of the proposed activity to benefit society and contribute to the achievement of specific, desired societal outcomes. The Broader Impacts can include education goals, and the community (communities) that will be impacted by its results.

Project Description: There is a 15-page limit for all proposals. Within the project description, include a section labeled 'Evaluation Plan' that includes a description of how the team will evaluate the proposed science/engineering . This plan could include results from applications of the research to specific outcomes in health domain, efficacy studies, assessments of learning and engagement, and other such evaluation. The proposed Evaluation Plan should be appropriate for the size and scope of the project.

Please note that the Collaboration Plan must be submitted as a Supplementary Document for this solicitation; see guidance below.

Proposal Budget: It is expected that the PIs, co-PIs, and other team members funded by the project will attend an SCH PI meeting annually to present project research findings and capacity-building or community outreach activities. Requested budgets should include funds for travel to this annual event for at least one project PI.

Supplementary Documents: In the Supplementary Documents Section, upload the following:

Collaboration Plan. Proposals must include a Collaboration Plan. The Collaboration Plan must be submitted as a supplementary document and cannot exceed two pages. Proposals that do not include a Collaboration Plan will be returned without review . The Collaboration Plan must be labeled "Collaboration Plan" and must include: 1) the specific roles of the collaborating PIs, co-PIs, other Senior/Key Personnel and paid consultants at all organizations involved; 2) how the project will be managed across institutions and disciplines; 3) identification of the specific collaboration mechanisms that will enable cross-institution and/or cross-discipline scientific integration (e.g., workshops, graduate student exchange, project meetings at conferences, use of videoconferencing and other communication tools, software repositories, etc.); and 4) specific references to the budget line items that support these collaboration mechanisms.
Human Subjects Protection. Proposals involving human subjects should include a supplementary document of no more than two pages in length summarizing potential risks to human subjects; plans for recruitment and informed consent; inclusion of women, minorities, and children; and planned procedures to protect against or minimize potential risks. Human subjects plans must include the NIH enrollment table ( https://era.nih.gov/erahelp/assist/Content/ASSIST_Help_Topics/3_Form_Screens/PHS_HS_CT/Incl_Enroll_Rprt.htm , please see the Planned enrollment table for the expected format).
Detailed description and justification of the proposed use of the animals, including species, strains, ages, sex, and number to be used;
Information on the veterinary care of the animals;
Description of procedures for minimizing discomfort, distress, pain, and injury; and
Method of euthanasia and the reasons for its selection.
Data Management and Sharing Plan (required). This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Chapter II.D.2 of the PAPPG for full policy implementation and the section on Data Management and Sharing Plans. For additional information on the Dissemination and Sharing of Research Results, see: https://www.nsf.gov/bfa/dias/policy/dmp.jsp . For specific guidance for Data Management and Sharing Plans submitted to the Directorate for Computer and Information Science and Engineering (CISE) see: https://www.nsf.gov/cise/cise_dmp.jsp .
Documentation of Collaborative Arrangements of Significance to the Proposal through Letters of Collaboration. There are two types of collaboration, one involving individuals/organizations that are included in the budget, and the other involving individuals/organizations that are not included in the budget. Collaborations that are included in the budget should be described in the Project Description and the Collaboration Plan. Any substantial collaboration with individuals/organizations not included in the budget should be described in the Facilities, Equipment and Other Resources section of the proposal (see PAPPG Chapter II.D.2.g). In either case, whether or not the collaborator is included in the budget, a letter of collaboration from each named participating organization other than the submitting lead, non-lead, and/or subawardee institutions must be provided at the time of submission of the proposal. Such letters describe the nature of the collaboration and what the collaborator(s) brings to the project. They must explicitly state the nature of the collaboration, appear on the organization's letterhead, and be signed by the appropriate organizational representative. These letters must not otherwise deviate from the restrictions and requirements set forth in the PAPPG, Chapter II.D.2.i. Please note that letters of support, that do not document collaborative arrangements of significance to the project, but primarily convey a sense of enthusiasm for the project and/or highlight the qualifications of the PI or co-PI may not be submitted, and reviewers will be instructed not to consider these letters in reviewing the merits of the proposal.
List of Project Personnel and Partner Institutions (Note - In collaborative proposals, only the lead institution should provide this information). Provide current, accurate information for all personnel and institutions involved in the project. NSF staff will use this information in the merit review process to manage reviewer selection. The list should include all PIs, co-PIs, Senior/Key Personnel, paid/unpaid Consultants or Collaborators, Subawardees, Postdocs, and project-level advisory committee members. This list should be numbered and include (in this order) Full name, Organization(s), and Role in the project, with each item separated by a semi-colon. Each person listed should start a new numbered line. For example: 1. Mei Lin; XYZ University; PI 2. Jak Jabes; University of PQR; Senior/Key Personnel 3. Jane Brown; XYZ University; Postdoctoral Researcher 4. Rakel Ademas; ABC Inc.; Paid Consultant 5. Maria Wan; Welldone Institution; Unpaid Collaborator 6. Rimon Greene; ZZZ University; Subawardee
Mentoring Plan (if applicable). Each proposal that requests funding to support postdoctoral scholars or graduate students must include, as a supplementary document, a description of the mentoring activities that will be provided for such individuals. Please be advised that if required, NSF systems will not permit submission of a proposal that is missing a Mentoring Plan. See Chapter II.D.2.i of the PAPPG for further information about the implementation of this requirement.
Other Specialized Information. RUI Proposals: PIs from predominantly undergraduate institutions should follow the instructions https://new.nsf.gov/funding/opportunities/facilitating-research-primarily-undergraduate .

Single Copy Documents:

(1) Collaborators and Other Affiliations Information.

Proposers should follow the guidance specified in Chapter II.D.2.h of the NSF PAPPG. Grants.gov Users: The COA information must be provided through use of the COA template and uploaded as a PDF attachment.

Note the distinction to the list of Project Personnel and Partner Institutions specified above under Supplementary Documents: the listing of all project participants is collected by the project lead and entered as a Supplementary Document, which is then automatically included with all proposals in a project. The Collaborators and Other Affiliations are entered for each participant within each proposal and, as Single Copy Documents, are available only to NSF staff.

SCH Proposal Preparation Checklist:

The following checklist is provided as a reminder of the items that should be verified before submitting a proposal to this solicitation. This checklist is a summary of the requirements described above and in the PAPPG and does not replace the complete set of requirements in the PAPPG. For the items marked with (RWR), the proposal will be returned without review if the required item is noncompliant at the submission deadline.

(RWR) A two-page Collaboration Plan must be included as a Supplementary Document.
Letters of Collaboration are permitted as Supplementary Documents.
(RWR) Project Summary not to exceed one page that includes three labeled sections: Overview, Intellectual Merit and Broader Impacts.
(RWR) Within the Project Description, a section labeled “Broader Impacts” that describes the potential to benefit society and contribute to the achievement of specific, desired societal outcomes.
Within the Project Description, a section labeled Evaluation Plan that details how the project will be evaluated.
(RWR) Within the Project Description, a description of "Results from Prior NSF Support”.
(RWR) If the budget includes postdoctoral scholars or graduate students, a one-page Mentoring Plan must be included as a Supplementary Document.
A list of Project Personnel and Partner Institutions is required as a Supplementary Document.
(RWR) A Data Management and Sharing Plan, not to exceed two pages, must be included as a Supplementary Document.
Proposals involving human subjects or animals, should include a Human Subjects Plan with an NIH Enrollment Table or Vertebrate Animals Plan as a Supplementary Document.
Collaborators & Other Affiliations (COA) for each PI, co-PI, and Senior/Key Personnel should be submitted using the spreadsheet template uploaded as Single Copy Documents.

Cost Sharing:

Indirect Cost (F&A) Limitations:

Budgets should include travel funds to attend one SCH PI meeting annually for the project PIs, co-PIs and other team members as appropriate from all collaborating institutions.

D. Research.gov/Grants.gov Requirements

For Proposals Submitted Via Research.gov:

To prepare and submit a proposal via Research.gov, see detailed technical instructions available at: https://www.research.gov/research-portal/appmanager/base/desktop?_nfpb=true&_pageLabel=research_node_display&_nodePath=/researchGov/Service/Desktop/ProposalPreparationandSubmission.html . For Research.gov user support, call the Research.gov Help Desk at 1-800-381-1532 or e-mail [email protected] . The Research.gov Help Desk answers general technical questions related to the use of the Research.gov system. Specific questions related to this program solicitation should be referred to the NSF program staff contact(s) listed in Section VIII of this funding opportunity.

For Proposals Submitted Via Grants.gov:

Before using Grants.gov for the first time, each organization must register to create an institutional profile. Once registered, the applicant's organization can then apply for any federal grant on the Grants.gov website. Comprehensive information about using Grants.gov is available on the Grants.gov Applicant Resources webpage: https://www.grants.gov/web/grants/applicants.html . In addition, the NSF Grants.gov Application Guide (see link in Section V.A) provides instructions regarding the technical preparation of proposals via Grants.gov. For Grants.gov user support, contact the Grants.gov Contact Center at 1-800-518-4726 or by email: [email protected] . The Grants.gov Contact Center answers general technical questions related to the use of Grants.gov. Specific questions related to this program solicitation should be referred to the NSF program staff contact(s) listed in Section VIII of this solicitation.

Submitting the Proposal: Once all documents have been completed, the Authorized Organizational Representative (AOR) must submit the application to Grants.gov and verify the desired funding opportunity and agency to which the application is submitted. The AOR must then sign and submit the application to Grants.gov. The completed application will be transferred to Research.gov for further processing.

The NSF Grants.gov Proposal Processing in Research.gov informational page provides submission guidance to applicants and links to helpful resources including the NSF Grants.gov Application Guide , Grants.gov Proposal Processing in Research.gov how-to guide , and Grants.gov Submitted Proposals Frequently Asked Questions . Grants.gov proposals must pass all NSF pre-check and post-check validations in order to be accepted by Research.gov at NSF.

When submitting via Grants.gov, NSF strongly recommends applicants initiate proposal submission at least five business days in advance of a deadline to allow adequate time to address NSF compliance errors and resubmissions by 5:00 p.m. submitting organization's local time on the deadline. Please note that some errors cannot be corrected in Grants.gov. Once a proposal passes pre-checks but fails any post-check, an applicant can only correct and submit the in-progress proposal in Research.gov.

Proposers that submitted via Research.gov may use Research.gov to verify the status of their submission to NSF. For proposers that submitted via Grants.gov, until an application has been received and validated by NSF, the Authorized Organizational Representative may check the status of an application on Grants.gov. After proposers have received an e-mail notification from NSF, Research.gov should be used to check the status of an application.

VI. NSF Proposal Processing And Review Procedures

Proposals received by NSF are assigned to the appropriate NSF program for acknowledgement and, if they meet NSF requirements, for review. All proposals are carefully reviewed by a scientist, engineer, or educator serving as an NSF Program Officer, and usually by three to ten other persons outside NSF either as ad hoc reviewers, panelists, or both, who are experts in the particular fields represented by the proposal. These reviewers are selected by Program Officers charged with oversight of the review process. Proposers are invited to suggest names of persons they believe are especially well qualified to review the proposal and/or persons they would prefer not review the proposal. These suggestions may serve as one source in the reviewer selection process at the Program Officer's discretion. Submission of such names, however, is optional. Care is taken to ensure that reviewers have no conflicts of interest with the proposal. In addition, Program Officers may obtain comments from site visits before recommending final action on proposals. Senior NSF staff further review recommendations for awards. A flowchart that depicts the entire NSF proposal and award process (and associated timeline) is included in PAPPG Exhibit III-1.

A comprehensive description of the Foundation's merit review process is available on the NSF website at: https://www.nsf.gov/bfa/dias/policy/merit_review/ .

Proposers should also be aware of core strategies that are essential to the fulfillment of NSF's mission, as articulated in Leading the World in Discovery and Innovation, STEM Talent Development and the Delivery of Benefits from Research - NSF Strategic Plan for Fiscal Years (FY) 2022 - 2026 . These strategies are integrated in the program planning and implementation process, of which proposal review is one part. NSF's mission is particularly well-implemented through the integration of research and education and broadening participation in NSF programs, projects, and activities.

One of the strategic objectives in support of NSF's mission is to foster integration of research and education through the programs, projects, and activities it supports at academic and research institutions. These institutions must recruit, train, and prepare a diverse STEM workforce to advance the frontiers of science and participate in the U.S. technology-based economy. NSF's contribution to the national innovation ecosystem is to provide cutting-edge research under the guidance of the Nation's most creative scientists and engineers. NSF also supports development of a strong science, technology, engineering, and mathematics (STEM) workforce by investing in building the knowledge that informs improvements in STEM teaching and learning.

NSF's mission calls for the broadening of opportunities and expanding participation of groups, institutions, and geographic regions that are underrepresented in STEM disciplines, which is essential to the health and vitality of science and engineering. NSF is committed to this principle of diversity and deems it central to the programs, projects, and activities it considers and supports.

A. Merit Review Principles and Criteria

The National Science Foundation strives to invest in a robust and diverse portfolio of projects that creates new knowledge and enables breakthroughs in understanding across all areas of science and engineering research and education. To identify which projects to support, NSF relies on a merit review process that incorporates consideration of both the technical aspects of a proposed project and its potential to contribute more broadly to advancing NSF's mission "to promote the progress of science; to advance the national health, prosperity, and welfare; to secure the national defense; and for other purposes." NSF makes every effort to conduct a fair, competitive, transparent merit review process for the selection of projects.

1. Merit Review Principles

These principles are to be given due diligence by PIs and organizations when preparing proposals and managing projects, by reviewers when reading and evaluating proposals, and by NSF program staff when determining whether or not to recommend proposals for funding and while overseeing awards. Given that NSF is the primary federal agency charged with nurturing and supporting excellence in basic research and education, the following three principles apply:

All NSF projects should be of the highest quality and have the potential to advance, if not transform, the frontiers of knowledge.
NSF projects, in the aggregate, should contribute more broadly to achieving societal goals. These "Broader Impacts" may be accomplished through the research itself, through activities that are directly related to specific research projects, or through activities that are supported by, but are complementary to, the project. The project activities may be based on previously established and/or innovative methods and approaches, but in either case must be well justified.
Meaningful assessment and evaluation of NSF funded projects should be based on appropriate metrics, keeping in mind the likely correlation between the effect of broader impacts and the resources provided to implement projects. If the size of the activity is limited, evaluation of that activity in isolation is not likely to be meaningful. Thus, assessing the effectiveness of these activities may best be done at a higher, more aggregated, level than the individual project.

With respect to the third principle, even if assessment of Broader Impacts outcomes for particular projects is done at an aggregated level, PIs are expected to be accountable for carrying out the activities described in the funded project. Thus, individual projects should include clearly stated goals, specific descriptions of the activities that the PI intends to do, and a plan in place to document the outputs of those activities.

These three merit review principles provide the basis for the merit review criteria, as well as a context within which the users of the criteria can better understand their intent.

2. Merit Review Criteria

All NSF proposals are evaluated through use of the two National Science Board approved merit review criteria. In some instances, however, NSF will employ additional criteria as required to highlight the specific objectives of certain programs and activities.

The two merit review criteria are listed below. Both criteria are to be given full consideration during the review and decision-making processes; each criterion is necessary but neither, by itself, is sufficient. Therefore, proposers must fully address both criteria. (PAPPG Chapter II.D.2.d(i). contains additional information for use by proposers in development of the Project Description section of the proposal). Reviewers are strongly encouraged to review the criteria, including PAPPG Chapter II.D.2.d(i), prior to the review of a proposal.

When evaluating NSF proposals, reviewers will be asked to consider what the proposers want to do, why they want to do it, how they plan to do it, how they will know if they succeed, and what benefits could accrue if the project is successful. These issues apply both to the technical aspects of the proposal and the way in which the project may make broader contributions. To that end, reviewers will be asked to evaluate all proposals against two criteria:

Intellectual Merit: The Intellectual Merit criterion encompasses the potential to advance knowledge; and
Broader Impacts: The Broader Impacts criterion encompasses the potential to benefit society and contribute to the achievement of specific, desired societal outcomes.

The following elements should be considered in the review for both criteria:

Advance knowledge and understanding within its own field or across different fields (Intellectual Merit); and
Benefit society or advance desired societal outcomes (Broader Impacts)?
To what extent do the proposed activities suggest and explore creative, original, or potentially transformative concepts?
Is the plan for carrying out the proposed activities well-reasoned, well-organized, and based on a sound rationale? Does the plan incorporate a mechanism to assess success?
How well qualified is the individual, team, or organization to conduct the proposed activities?
Are there adequate resources available to the PI (either at the home organization or through collaborations) to carry out the proposed activities?

Broader impacts may be accomplished through the research itself, through the activities that are directly related to specific research projects, or through activities that are supported by, but are complementary to, the project. NSF values the advancement of scientific knowledge and activities that contribute to achievement of societally relevant outcomes. Such outcomes include, but are not limited to: full participation of women, persons with disabilities, and other underrepresented groups in science, technology, engineering, and mathematics (STEM); improved STEM education and educator development at any level; increased public scientific literacy and public engagement with science and technology; improved well-being of individuals in society; development of a diverse, globally competitive STEM workforce; increased partnerships between academia, industry, and others; improved national security; increased economic competitiveness of the United States; and enhanced infrastructure for research and education.

Proposers are reminded that reviewers will also be asked to review the Data Management and Sharing Plan and the Mentoring Plan, as appropriate.

Additional Solicitation Specific Review Criteria

The proposals will also be evaluated based on:

Collaboration and Management: The work to be funded by this solicitation must make fundamental contributions to two or more disciplines and address a key health problem. The collaboration plan should demonstrate active participation of this multidisciplinary group, which includes, but is not limited to: fundamental science and engineering researchers; biomedical, health and/or clinical researchers; other necessary research expertise; client groups; and, technology vendors/commercial enterprises. The collaboration plan should include the roles and demonstrate the extent to which the group is integrated, has a common focus and the quality of the collaboration plan.

Additional NIH Review Criteria:

The mission of the NIH is to support science in pursuit of knowledge about the biology and behavior of living systems and to apply that knowledge to extend healthy life and reduce the burdens of illness and disability. In their evaluations of scientific merit, reviewers will be asked to consider the following criteria that are used by NIH:

Overall Impact. Reviewers will provide an overall impact/priority score and criterion scores to reflect their assessment of the likelihood for the project to exert a sustained, powerful influence on the research field(s) involved, in consideration of the following five core review criteria, and additional review criteria (as applicable for the project proposed).

Significance. Does the project address an important problem or a critical barrier to progress in the field? If the aims of the project are achieved, how will scientific knowledge, technical capability, and/or clinical practice be improved? How will successful completion of the aims change the concepts, methods, technologies, treatments, services, or preventative interventions that drive this field?

Investigator(s). Are the PD/PIs, collaborators, and other researchers well suited to the project? If Early Stage Investigators or New Investigators, do they have appropriate experience and training? If established, have they demonstrated an ongoing record of accomplishments that have advanced their field(s)? If the project is collaborative or multi-PD/PI, do the investigators have complementary and integrated expertise; are their leadership approach, governance and organizational structure appropriate for the project?

Innovation. Does the application challenge and seek to shift current research or clinical practice paradigms by utilizing novel theoretical concepts, approaches or methodologies, instrumentation, or interventions? Are the concepts, approaches or methodologies, instrumentation, or interventions novel to one field of research or novel in a broad sense? Is a refinement, improvement, or new application of theoretical concepts, approaches or methodologies, instrumentation, or interventions proposed?

Approach. Are the overall strategy, methodology, and analyses well-reasoned and appropriate to accomplish the specific aims of the project? Are potential problems, alternative strategies, and benchmarks for success presented? If the project is in the early stages of development, will the strategy establish feasibility and will particularly risky aspects be managed? If the project involves clinical research, are the plans for 1) protection of human subjects from research risks, and 2) inclusion of women, minorities, and individuals across the lifespan justified in terms of the scientific goals and research strategy proposed?

Environment. Will the scientific environment in which the work will be done contribute to the probability of success? Are the institutional support, equipment and other physical resources available to the investigators adequate for the project proposed? Will the project benefit from unique features of the scientific environment, subject populations, or collaborative arrangements? Where applicable, the following items will also be considered:

Protections for Human Subjects. For research that involves human subjects but does not involve one of the six categories of research that are exempt under 45 CFR Part 46, the committee will evaluate the justification for involvement of human subjects and the proposed protections from research risk relating to their participation according to the following five review criteria: 1) risk to subjects, 2) adequacy of protection against risks, 3) potential benefits to the subjects and others, 4) importance of the knowledge to be gained, and 5) data and safety monitoring. For research that involves human subjects and meets the criteria for one or more of the six categories of research that are exempt under 45 CFR Part 46, the committee will evaluate: 1) the justification for the exemption, 2) human subjects involvement and characteristics, and 3) sources of materials. For additional information on review of the Human Subjects section, please refer to the Human Subjects Protection and Inclusion Guidelines.

Inclusion of Women, Minorities, and Individuals Across the Lifespan . When the proposed project involves human subjects and/or NIH-defined clinical research, the committee will evaluate the proposed plans for the inclusion (or exclusion) of individuals on the basis of sex/gender, race, and ethnicity, as well as the inclusion (or exclusion) of individuals of all ages (including children and older adults) to determine if it is justified in terms of the scientific goals and research strategy proposed. For additional information on review of the Inclusion section, please refer to the Guidelines for the Review of Inclusion in Clinical Research .

Vertebrate Animals. The committee will evaluate the involvement of live vertebrate animals as part of the scientific assessment according to the following four points: 1) description and justification of the proposed use of the animals, and species, strains, ages, sex, and numbers to be used; 2) adequacy of veterinary care; 3) procedures for minimizing discomfort, distress, pain and injury; and 4) methods of euthanasia and reason for selection if not consistent with the AVMA Guidelines on Euthanasia. For additional information, see http://grants.nih.gov/grants/olaw/VASchecklist.pdf .

Biohazards. Reviewers will assess whether materials or procedures proposed are potentially hazardous to research personnel and/or the environment, and if needed, determine whether adequate protection is proposed.

Budget and Period of Support (non-score-driving). Reviewers will consider whether the budget and the requested period of support are fully justified and reasonable in relation to the proposed research.

B. Review and Selection Process

Proposals submitted in response to this program solicitation will be reviewed by Ad hoc Review and/or Panel Review, or Joint agency review.

Proposals submitted in response to this program solicitation will be reviewed by Ad hoc Review and/or Panel Review, or joint agency review.

Reviewers will be asked to evaluate proposals using two National Science Board approved merit review criteria and, if applicable, additional program specific criteria. A summary rating and accompanying narrative will generally be completed and submitted by each reviewer and/or panel. The Program Officer assigned to manage the proposal's review will consider the advice of reviewers and will formulate a recommendation.

NSF will manage and conduct the peer review process for this solicitation; the National Institutes of Health will only observe the review process and have access to proposal and review information, such as unattributed reviews and panel summaries. NIH and NSF will make independent award decisions.

After scientific, technical and programmatic review and consideration of appropriate factors, the NSF Program Officer recommends to the cognizant Division Director whether the proposal should be declined or recommended for award. NSF strives to be able to tell applicants whether their proposals have been declined or recommended for funding within six months. Large or particularly complex proposals or proposals from new awardees may require additional review and processing time. The time interval begins on the deadline or target date, or receipt date, whichever is later. The interval ends when the Division Director acts upon the Program Officer's recommendation.

After programmatic approval has been obtained, the proposals recommended for funding will be forwarded to the Division of Grants and Agreements or the Division of Acquisition and Cooperative Support for review of business, financial, and policy implications. After an administrative review has occurred, Grants and Agreements Officers perform the processing and issuance of a grant or other agreement. Proposers are cautioned that only a Grants and Agreements Officer may make commitments, obligations or awards on behalf of NSF or authorize the expenditure of funds. No commitment on the part of NSF should be inferred from technical or budgetary discussions with a NSF Program Officer. A Principal Investigator or organization that makes financial or personnel commitments in the absence of a grant or cooperative agreement signed by the NSF Grants and Agreements Officer does so at their own risk.

Once an award or declination decision has been made, Principal Investigators are provided feedback about their proposals. In all cases, reviews are treated as confidential documents. Verbatim copies of reviews, excluding the names of the reviewers or any reviewer-identifying information, are sent to the Principal Investigator/Project Director by the Program Officer. In addition, the proposer will receive an explanation of the decision to award or decline funding.

Review Process and Deviations from the NSF PAPPG

This section provides agency-specific guidance for the SCH program.

NSF will take the lead in organizing and conducting the review process in compliance with the Federal Advisory Committee Act.

In addition to any conflict forms required by NSF, an NIH Post-Review Certification Form will be circulated at or near the end of the second day of the review meeting and collected by the NIH Scientific Review Officer (SRO). By signing the Post-Review Certification Form, panelists will certify for NIH that confidentiality and conflict-of-interest procedures have been followed. Conflicts of interest are handled in a manner similar to NSF procedures: those in conflict will be asked to step out of the room, or as appropriate, NSF's Designated Ethics Official or designee may recommend remedies to resolve specific conflicts on a case by case basis. Co-investigators and investigators that would directly benefit should the grant be awarded are ineligible to serve as reviewers.

Approximately seven to 10 review panels, equivalent to NIH study sections, will be organized each year, with the exact number and topical clustering of panels determined according to the number and topical areas of the proposals received. Panel management will be conducted by the four NSF directorates, with the majority conducted by CISE. Co-review across clusters, divisions and directorates will be performed where appropriate. SROs from the CSR at the NIH will be assigned to work cooperatively with NSF staff on each proposal panel. Together, they will have the responsibility to work out the details of the review process such that all agencies' needs are met. Before the review panel meetings, the representatives from the NIH SROs will work together with the NSF staff to prepare written instructions for the reviewers and to develop and implement an NIH-like scoring system (1-9) for NIH use on proposal panels. The representatives of all participating NIH Institutes and Centers will also be invited to attend the review meetings to ensure that this review is conducted in a manner that is consistent with the agreements between NSF and the NIH.

After scientific, technical and programmatic review and consideration of appropriate factors, the NSF Program Officer recommends to the cognizant Division Director whether the proposal should be declined or recommended for award. NSF is striving to be able to tell applicants whether their proposals have been declined or recommended for funding within six months. The time interval begins on the deadline or target date, or receipt date, whichever is later. The interval ends when the Division Director accepts the Program Officer's recommendation.

A summary rating and accompanying narrative will be completed and submitted by each reviewer. In all cases, reviews are treated as confidential documents. Verbatim copies of reviews, excluding the names of the reviewers, are sent to the Principal Investigator/Project Director by the Program Officer. In addition, the proposer will receive an explanation of the decision to award or decline funding.

In all cases, after programmatic approval has been obtained, the proposals recommended for funding will be forwarded to the Division of Grants and Agreements for review of business, financial, and policy implications and the processing and issuance of a grant or other agreement. Proposers are cautioned that only a Grants and Agreements Officer may make commitments, obligations or awards on behalf of NSF or authorize the expenditure of funds. No commitment on the part of NSF should be inferred from technical or budgetary discussions with a NSF Program Officer. A Principal Investigator or organization that makes financial or personnel commitments in the absence of a grant or cooperative agreement signed by the NSF Grants and Agreements Officer does so at their own risk.

For those proposals that are selected for potential funding by participating NIH Institutes, the PI will be required to resubmit the proposal in an NIH-approved format. Applicants must then complete the submission process and track the status of the application in the eRA Commons, NIH’s electronic system for grants administration. PIs invited to resubmit to NIH will receive further information on this process from the NIH.

Consistent with the NIH Policy for Data Management and Sharing, when data management and sharing is applicable to the award, recipients will be required to adhere to the Data Management and Sharing requirements as outlined in the NIH Grants Policy Statement . Upon the approval of a Data Management and Sharing Plan, it is required for recipients to implement the plan as described. All applicants planning research (funded by NIH) that results in the generation of scientific data are required to comply with the instructions for the NIH Data Management and Sharing Plan. All applications, regardless of the amount of direct costs requested for any one year, must address a Data Management and Sharing Plan.

An applicant will not be allowed to increase the proposed budget or change the scientific content of the application in the reformatted submission to the NIH. Indirect costs on any foreign subawards/subcontracts will be limited to eight (8) percent. Applicants will be expected to utilize the Multiple Principal Investigator option at the NIH ( https://grants.nih.gov/grants/multi_PI/ ) as appropriate.

To fulfill NIH's need for a list of participating reviewers for Summary Statements without disclosing the specific reviewers of each proposal, NSF will provide an aggregated list of the full set of reviewers for the SCH program to CSR.

Following the NSF peer review, recommended applications that have been resubmitted to the NIH are required to go to second level review by the Advisory Council or Advisory Board of the awarding Institute or Center. The following will be considered in making funding decisions:

Scientific and technical merit of the proposed project as determined by scientific peer review.
Availability of funds.
Relevance of the proposed project to program priorities.
Adequacy of data management and sharing plans.

Subsequent grant administration procedures for NIH awardees, including those related to New and Early Stage Investigators ( https://grants.nih.gov/policy/early-investigators/index.htm ), will be in accordance with the policies of NIH. Applications selected for NIH funding will use the NIH R or U funding mechanisms. At the end of the project period, renewal applications for projects funded by the NIH are expected to be submitted directly to the NIH as Renewal Applications.

VII. Award Administration Information

A. notification of the award.

Notification of the award is made to the submitting organization by an NSF Grants and Agreements Officer. Organizations whose proposals are declined will be advised as promptly as possible by the cognizant NSF Program administering the program. Verbatim copies of reviews, not including the identity of the reviewer, will be provided automatically to the Principal Investigator. (See Section VI.B. for additional information on the review process.)

B. Award Conditions

An NSF award consists of: (1) the award notice, which includes any special provisions applicable to the award and any numbered amendments thereto; (2) the budget, which indicates the amounts, by categories of expense, on which NSF has based its support (or otherwise communicates any specific approvals or disapprovals of proposed expenditures); (3) the proposal referenced in the award notice; (4) the applicable award conditions, such as Grant General Conditions (GC-1)*; or Research Terms and Conditions* and (5) any announcement or other NSF issuance that may be incorporated by reference in the award notice. Cooperative agreements also are administered in accordance with NSF Cooperative Agreement Financial and Administrative Terms and Conditions (CA-FATC) and the applicable Programmatic Terms and Conditions. NSF awards are electronically signed by an NSF Grants and Agreements Officer and transmitted electronically to the organization via e-mail.

*These documents may be accessed electronically on NSF's Website at https://www.nsf.gov/awards/managing/award_conditions.jsp?org=NSF . Paper copies may be obtained from the NSF Publications Clearinghouse, telephone (703) 292-8134 or by e-mail from [email protected] .

More comprehensive information on NSF Award Conditions and other important information on the administration of NSF awards is contained in the NSF Proposal & Award Policies & Procedures Guide (PAPPG) Chapter VII, available electronically on the NSF Website at https://www.nsf.gov/publications/pub_summ.jsp?ods_key=pappg .

Administrative and National Policy Requirements

Build America, Buy America

As expressed in Executive Order 14005, Ensuring the Future is Made in All of America by All of America’s Workers (86 FR 7475), it is the policy of the executive branch to use terms and conditions of Federal financial assistance awards to maximize, consistent with law, the use of goods, products, and materials produced in, and services offered in, the United States.

Consistent with the requirements of the Build America, Buy America Act (Pub. L. 117-58, Division G, Title IX, Subtitle A, November 15, 2021), no funding made available through this funding opportunity may be obligated for an award unless all iron, steel, manufactured products, and construction materials used in the project are produced in the United States. For additional information, visit NSF’s Build America, Buy America webpage.

Special Award Conditions:

Attribution of support in publications must acknowledge the joint program, as well as the funding organization and award number, by including the phrase, "as part of the NSF/NIH Smart Health and Biomedical Research in the Era of Artificial Intelligence and Advanced Data Science Program."

NIH-Specific Award Conditions: Grants made by NSF will be subject to NSF's award conditions. Grants made by NIH will be subject to NIH's award conditions (see http://grants.nih.gov/grants/policy/awardconditions.htm ).

C. Reporting Requirements

For all multi-year grants (including both standard and continuing grants), the Principal Investigator must submit an annual project report to the cognizant Program Officer no later than 90 days prior to the end of the current budget period. (Some programs or awards require submission of more frequent project reports). No later than 120 days following expiration of a grant, the PI also is required to submit a final annual project report, and a project outcomes report for the general public.

Failure to provide the required annual or final annual project reports, or the project outcomes report, will delay NSF review and processing of any future funding increments as well as any pending proposals for all identified PIs and co-PIs on a given award. PIs should examine the formats of the required reports in advance to assure availability of required data.

PIs are required to use NSF's electronic project-reporting system, available through Research.gov, for preparation and submission of annual and final annual project reports. Such reports provide information on accomplishments, project participants (individual and organizational), publications, and other specific products and impacts of the project. Submission of the report via Research.gov constitutes certification by the PI that the contents of the report are accurate and complete. The project outcomes report also must be prepared and submitted using Research.gov. This report serves as a brief summary, prepared specifically for the public, of the nature and outcomes of the project. This report will be posted on the NSF website exactly as it is submitted by the PI.

More comprehensive information on NSF Reporting Requirements and other important information on the administration of NSF awards is contained in the NSF Proposal & Award Policies & Procedures Guide (PAPPG) Chapter VII, available electronically on the NSF Website at https://www.nsf.gov/publications/pub_summ.jsp?ods_key=pappg .

Additional data may be required for NSF sponsored Cooperative Agreements.

Proposals which are initially funded by NSF at a level of $300,000 of total costs per year up to four years will be evaluated based on the proposed work plan by teams of experts periodically through the term of the project to determine performance levels. All publications, reports, data and other output from all awards must be prepared in digital format and meet general requirements for storage, indexing, searching and retrieval.

Contact the cognizant organization program officer for additional information.

VIII. Agency Contacts

Please note that the program contact information is current at the time of publishing. See program website for any updates to the points of contact.

General inquiries regarding this program should be made to:

For questions related to the use of NSF systems contact:

NSF Help Desk: 1-800-381-1532
Research.gov Help Desk e-mail: [email protected]

For questions relating to Grants.gov contact:

Grants.gov Contact Center: If the Authorized Organizational Representatives (AOR) has not received a confirmation message from Grants.gov within 48 hours of submission of application, please contact via telephone: 1-800-518-4726; e-mail: [email protected] .

IX. Other Information

The NSF website provides the most comprehensive source of information on NSF Directorates (including contact information), programs and funding opportunities. Use of this website by potential proposers is strongly encouraged. In addition, "NSF Update" is an information-delivery system designed to keep potential proposers and other interested parties apprised of new NSF funding opportunities and publications, important changes in proposal and award policies and procedures, and upcoming NSF Grants Conferences . Subscribers are informed through e-mail or the user's Web browser each time new publications are issued that match their identified interests. "NSF Update" also is available on NSF's website .

Grants.gov provides an additional electronic capability to search for Federal government-wide grant opportunities. NSF funding opportunities may be accessed via this mechanism. Further information on Grants.gov may be obtained at https://www.grants.gov .

Main Websites for the Participating Agencies: NATIONAL SCIENCE FOUNDATION https://www.nsf.gov NATIONAL INSTITUTES OF HEALTH http://nih.gov/ PUBLIC BRIEFINGS One or more collaborative webinar briefings with question and answer functionality will be held prior to the first submission deadline date. Schedules will be posted on the sponsor solicitation web sites.

About The National Science Foundation

The National Science Foundation (NSF) is an independent Federal agency created by the National Science Foundation Act of 1950, as amended (42 USC 1861-75). The Act states the purpose of the NSF is "to promote the progress of science; [and] to advance the national health, prosperity, and welfare by supporting research and education in all fields of science and engineering."

NSF funds research and education in most fields of science and engineering. It does this through grants and cooperative agreements to more than 2,000 colleges, universities, K-12 school systems, businesses, informal science organizations and other research organizations throughout the US. The Foundation accounts for about one-fourth of Federal support to academic institutions for basic research.

NSF receives approximately 55,000 proposals each year for research, education and training projects, of which approximately 11,000 are funded. In addition, the Foundation receives several thousand applications for graduate and postdoctoral fellowships. The agency operates no laboratories itself but does support National Research Centers, user facilities, certain oceanographic vessels and Arctic and Antarctic research stations. The Foundation also supports cooperative research between universities and industry, US participation in international scientific and engineering efforts, and educational activities at every academic level.

Facilitation Awards for Scientists and Engineers with Disabilities (FASED) provide funding for special assistance or equipment to enable persons with disabilities to work on NSF-supported projects. See the NSF Proposal & Award Policies & Procedures Guide Chapter II.F.7 for instructions regarding preparation of these types of proposals.

The National Science Foundation has Telephonic Device for the Deaf (TDD) and Federal Information Relay Service (FIRS) capabilities that enable individuals with hearing impairments to communicate with the Foundation about NSF programs, employment or general information. TDD may be accessed at (703) 292-5090 and (800) 281-8749, FIRS at (800) 877-8339.

The National Science Foundation Information Center may be reached at (703) 292-5111.

The National Science Foundation promotes and advances scientific progress in the United States by competitively awarding grants and cooperative agreements for research and education in the sciences, mathematics, and engineering.

To get the latest information about program deadlines, to download copies of NSF publications, and to access abstracts of awards, visit the NSF Website at

	2415 Eisenhower Avenue, Alexandria, VA 22314
(NSF Information Center)	(703) 292-5111
	(703) 292-5090

Send an e-mail to:
or telephone:	(703) 292-8134
	(703) 292-5111

Privacy Act And Public Burden Statements

The information requested on proposal forms and project reports is solicited under the authority of the National Science Foundation Act of 1950, as amended. The information on proposal forms will be used in connection with the selection of qualified proposals; and project reports submitted by proposers will be used for program evaluation and reporting within the Executive Branch and to Congress. The information requested may be disclosed to qualified reviewers and staff assistants as part of the proposal review process; to proposer institutions/grantees to provide or obtain data regarding the proposal review process, award decisions, or the administration of awards; to government contractors, experts, volunteers and researchers and educators as necessary to complete assigned work; to other government agencies or other entities needing information regarding proposers or nominees as part of a joint application review process, or in order to coordinate programs or policy; and to another Federal agency, court, or party in a court or Federal administrative proceeding if the government is a party. Information about Principal Investigators may be added to the Reviewer file and used to select potential candidates to serve as peer reviewers or advisory committee members. See System of Record Notices , NSF-50 , "Principal Investigator/Proposal File and Associated Records," and NSF-51 , "Reviewer/Proposal File and Associated Records.” Submission of the information is voluntary. Failure to provide full and complete information, however, may reduce the possibility of receiving an award.

An agency may not conduct or sponsor, and a person is not required to respond to, an information collection unless it displays a valid Office of Management and Budget (OMB) control number. The OMB control number for this collection is 3145-0058. Public reporting burden for this collection of information is estimated to average 120 hours per response, including the time for reviewing instructions. Send comments regarding the burden estimate and any other aspect of this collection of information, including suggestions for reducing this burden, to:

Suzanne H. Plimpton Reports Clearance Officer Policy Office, Division of Institution and Award Support Office of Budget, Finance, and Award Management National Science Foundation Alexandria, VA 22314

IMAGES

Infister Technology
Data Scientist vs Data Engineer
What is Data Engineering? Everything You Need to Know
$78.2M Cooperative Agreement Aims to Merge Engineering, Data Science
What Is Data Engineering?
Data Science and Engineering: Research Areas

VIDEO

Master Data Engineering with Dimensions: 7 Tips for Success
Introduction to Data Engineering in ArcGIS Pro
July Data Engineering / Software Engineering Q&A
How I Prepare for Data Engineering Interviews?
Data Engineering has a Requirements Problem
Data Engineering: Do I need a Degree? #dataengineering #shorts #careeradvice

COMMENTS

Institute for Data Engineering and Science (IDEaS)
The Institute for Data Engineering and Science (IDEaS) provides a unified point to connect government, industry, and academia to advance foundational research, and accelerate the adoption of Big Data technology. IDEaS leverages expertise and resources from throughout Georgia Tech's colleges, research labs, and external partners, to define and ...
What Is a Data Engineer? A Guide to This In-Demand Career
What Is a Data Engineer? A Guide to This In-Demand Career
Ten Research Challenge Areas in Data Science
Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society.
What Is Data Engineering?
What is data engineering? Data engineering is the practice of designing and building systems for the aggregation, storage and analysis of data at scale. Data engineers empower organizations to get insights in real time from large datasets. From social media and marketing metrics to employee performance statistics and trend forecasts ...
Data Science and Engineering: Research Areas
Some of the current areas of research in Data Science and Engineering are categorized and enumerated below : 1. Artificial Intelligence / Machine Learning: While human beings learn from experience, machines learn from data and improve their accuracy over time. AI applications attempt to mimic human intelligence by a computer, robot, or other ...
(PDF) Evolving Paradigms of Data Engineering in the Modern Era
The exponential data volume, velocity, and variety growth in the digital era have profoundly impacted data engineering. Traditional paradigms centered on batch processing in on-premise data ...
What is Data Engineering?
At DataCamp we believe the following to be the skills of a good data engineer: Software engineering background Data engineers need to have a software engineering background, since they need to approach data engineering challenges with their knowledge of object-oriented programming, data structures and algorithms.
Introduction to Data Science and Engineering
The Data Science and Engineering Journal (DSE) has been created to provide a comprehensive international forum for original results in research, design, development, and assessment of technologies that timely address relevant challenges in data management and data-intensive applications.The journal will discuss problems and solutions at all levels of investigation across the data management ...
Data Engineering Career Learning Path
Data Engineering. Learn essential skills to build a career as a data engineer by enrolling in top-rated programs from leading universities and companies. ... The skills and knowledge I gained still continue to help me at my current job as a research engineer in computer vision. If you're interested in getting into the AI/ML/DL field, I would ...
Four Generations in Data Engineering for Data Science
Data-driven methods and data science are important scientific methods in many research fields. All data science approaches require professional data engineering components. At the moment, computer science experts are needed for solving these data engineering tasks. Simultaneously, scientists from many fields (like natural sciences, medicine, environmental sciences, and engineering) want to ...
Data engineering 101: lifecycle, best practices, and ...
The data engineering lifecycle. The data engineering lifecycle is one of the key fundamentals of data engineering. It focuses on the stages a data engineer controls. Undercurrents are key principles or methodologies that overlap across the stages. Diagram of the data engineering lifecycle and key principles.
Georgia Tech IDEaS Research Overview
Research Overview Big Data Data Science is an interdisciplinary field that is concerned with systems, storage, software, algorithms, and applications for extracting knowledge or insights from data. Data-driven research is also commonplace in many fields of sciences and engineering, where direct observations (astronomy), instrumentation (sensors, DNA sequencers, electron microscopes), or ...
1063 PDFs
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA ENGINEERING. Find methods information, sources, references or conduct a literature review on ...
Master of Science in Engineering Data Science
Master of Science in Engineering Data Science
Data Science
Data Science Innovation in Data Engineering and Science (IDEAS) IDEAS will commit $60 million in resources for faculty hiring and research in the areas of data-driven scientific discovery and experimentation, design and engineering of safe, explainable and trustable autonomous systems, and data science for neuro engineering and bio-inspired computing.
Learning Data Engineer Skills: Career Paths and Courses
Data engineering is a profession with skills that are positioned between software engineering and programming on one side, and advanced analytics skills like those needed by data scientists on the other side. To be successful in data engineering requires solid programming skills, statistics knowledge, analytical skills, and an understanding of ...
Data Engineering for Everyone
Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce, which presents a severe challenge to ML deployment at scale. Much like the software-engineering ...
Research
Research. Our research in data is deep and diverse. Data engineering is a multi-faceted field. We work across a range of areas to ensure reliable, efficient information access for all. Big data management. The volume of data is growing exponentially. And with it, the need for businesses to better manage their information.
Data Scientist vs Data Engineer
Data Scientist vs Data Engineer | What's the Difference?
What Is a Data Engineer?: A Guide to This In-Demand Career
A Guide to This In-Demand Career. Big data is changing how we do business and creating a need for data engineers who can collect and manage large quantities of data. Data engineering is the practice of designing and building systems for collecting, storing, and analysing data at scale. It is a broad field with applications in just about every ...
Data Engineering: A Guide to the Who, What, and How
Data Engineering: A Guide to the Who, What, and How
Data-driven engineering design: A systematic review ...
The yearly increase in the number of research articles on data-driven engineering design appears to be aligned with the growing trend of AI development, as shown by Wang et al. [41]. 3.2. RQ2: Main topics in DDED. Data-driven engineering design is a vast research area that covers diverse topics. In this study, the main topics occurring in DDED ...
Data Science in Engineering
Data Science in Engineering. Revolutionary advances in the ability to capture data at massive scales and to extract actionable information have had a transformative impact on the world. Technology has capitalized on these developments to fuel new industries and to reshape society in fundamental ways, accelerating scientific discovery and ...
Research Data
Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question. ... Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of ...
Integrating Data Science Into Undergraduate Science and Engineering
Abstract: Contribution: This article discusses a research-practice partnership (RPP) where instructors from six undergraduate courses in three universities developed data science modules tailored to the needs of their respective disciplines, academic levels, and pedagogies. Background: STEM disciplines at universities are incorporating data science topics to meet employer demands for data ...
Best Data Engineering Courses Online with Certificates [2024]
Best Data Engineering Courses Online with Certificates ...
How to Build a Portfolio of Data Engineering Projects
In essence, a data engineering portfolio serves as a tangible representation of an individual's capabilities, demonstrating their proficiency in designing, constructing, and maintaining the intricate pipelines that facilitate the seamless flow of data within an organization. It goes beyond mere technical prowess, encapsulating an individual ...
Fault Detection in Industrial Wastewater Treatment Processes Using
The treatment of industrial wastewater is becoming increasingly important due to growing environmental concerns. Untreated wastewater carries hazardous substances that can severely damage water resources and lead to further negative impacts on the environment. To ensure treated wastewater meets the discharge standards, real-time monitoring for prompt fault detection, adjustments, and system ...
Smart Health and Biomedical Research in the Era of Artificial
For example, the Networking and Information Technology Research and Development (NITRD) program released the Federal Health Information Technology Research and Development Strategic Framework in 2020 that pointed to an overwhelming need for the integration between the computing, informatics, engineering, mathematics and statistics, behavioral ...