1. Data Acquisition and Integration :
- Source, gather, and integrate data from various internal and external platforms and databases.
- Collaborate with data providers and stakeholders to understand and ensure data availability and quality.
2. Data Cleaning :
- Identify, diagnose, and resolve any data inconsistencies, anomalies, and missing data.
- Design and implement data cleaning procedures to enhance data quality and reliability.
- Document any data transformations, anomalies, and resolutions to maintain data integrity.
3. Database Management :
Design and maintain scalable and optimized database schemas for storing and retrieving machine learning datasets.
4. Creation of Queries :
- Develop complex SQL queries to extract, transform, and load (ETL) data tailored to specific machine learning tasks.
- Create data views and aggregations to simplify data access and usage by machine learning teams.
- Optimize query performance to ensure swift data retrieval.
5. Process Optimization & Data Pipelines :
- Design, implement, and maintain ETL pipelines for seamless data flow across systems.
- Optimize existing data processes for speed, cost-efficiency, and reliability.
- Automate recurring tasks and jobs to ensure timely data availability for machine learning projects.
- Monitor and ensure the smooth running of data pipelines, troubleshoot any issues, and provide quick resolutions.
6. Support for Machine Learning Testing :
- Work with the team to understand data needs for model development and testing.
- Assist in debugging data-related issues in machine learning pipelines, such as data leakage, imbalances, or missing values.
Requirements : Qualifications :
Qualifications :
- Bachelor's or higher degree in a related field (e.g., Data Science, Computer Science, Statistics).
- Knowledge of ETL tools (e.g., Apache Nifi, Talend, Informatica) for data acquisition and transformation.
- Experience using APIs and connectors for data extraction from various sources.
- Proficiency in database management systems (e.g., MySQL, PostgreSQL, MongoDB) and SQL knowledge.
- Ability to design and maintain optimized database schemas.
- Proficiency in writing complex SQL queries for data extraction and transformation.
- Experience with task automation and orchestration tools, such as Apache Airflow.
- Ability to use data monitoring and logging tools to ensure proper data flow.
- Capability to troubleshoot and provide efficient solutions in case of data flow interruptions.
- Understanding data requirements for machine learning projects and the ability to identify data issues in ML pipelines.
- Ability to document data transformations and issue resolutions to maintain data integrity.
- Skill in optimizing ETL processes and queries for efficient performance.
- Knowledge of programming languages like Python or Java can be beneficial for automation and customization tasks.
- Familiarity with data cleaning tools.
- Awareness of data security best practices and the ability to implement security measures in data flows.
- Strong problem-solving skills and attention to detail.
- Excellent communication and collaboration skills to work with cross-functional teams.