This article discusses the data science skills necessary at every stage of the lifecycle of production-level data science and machine learning systems. It is a result of a cumulative professional experience of authors at JobsInData.

There are two types of data science skills: core skills and enablers. Both are critical for data scientists to master, although some may be more important depending on your current data science career level.

Compared to other skill lists for data scientists, the top data science skills list presented in this article focuses on the following aspects:

Skills needed for production-grade data science systems
Skills enabling maximization of your lifetime data science salary
Skills required for different stages of a data science career path and their impact on career progression
Data science resources to improve your data science resume

Overview - the Data Science Skills Matrix

Data Science Skills

The core Data Science skills:

C1. Transforming business problems into data science frameworks
C2. Data preparation & activation:
C3. Model training:
C4. Inference and monitoring
C5. Long term development & maintenance of Data Science production systems

Let's deep-dive into every skill.

C1 Transforming business problems into data science frameworks

The ability to translate the needs of product teams and business teams into practical, solvable Machine Learning problems. In particular, it involves answering the following questions: - What is the target we are trying to predict? - What are the possible features? - Is there a logical causality between features and the target? - What is the logical path from the model outcomes into business decisions?

An illustrative example from e-commerce recommender systems (simplified for the ease of exposition):

Business Problem: Personalise user interaction with ecommerce website: show the most relevant items to maximise the likelihood of conversion weighted by the item value. The underlying goal is to maximise the GMV - gross merchandise value.

Model specification: Target: Given that the user was shown the product, has she bought it or not? (Classification problem) Features: The historical user’s purchases up until the moment of the given impression expressed as a product-level buy2buy matrix.

Causality between features and target: Yes. Historical purchases give insight into a user's personal taste and needs, and hence could be a good predictor for future purchases.

Logical path from the model into business decision: At the moment of the request for product recommendations, rank and display Top N products according to the following formula:

E(GMV_i) = P(C_i) * Price_i
#where: 
#E(GMV_i) - expected Gross Merchandise Value from showing product i to the user
#P(C_i) - probability of conversion for product i (the model’s output)
#P_i - price of product i

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers

Resources to improve:

This is a skill that comes mostly from experience & industry knowledge. However, there are quality materials out there on this topic:

Google ML problem framing course
Yahoo Paper on Practical Framework for CTR prediction
Various courses at universities involving ML Case Study solving

C2. Data preparation & activation

C2a Data sources extension

Data sources extension is about bringing new data sources relevant to the problem. It is connected to, yet conceptually differt from Feature Engineering:

Feature Engineering is about extracting the signal from the existing data
Data sources extension is about adding new data to the problem. It might come in different forms:
- More records of the same nature
- New characteristics to enrich existing records in the database

Data sources externsion is generally more subtle and complicated than feature engineering, but unlocks signals that were not yet present in the dataset.

Critical for:

Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve:

This skill comes with industry knowledge and discussions with industry experts
A careful consideration of all possible signals influencing the target in the model, potentially using consulting - like tools like a driver tree analysis

C2b ETL - extract, transform & load

Loading the data and preparing if for analysies - visualizations, feature engineering and model training. The key skills here are:

Matching the data type vs business need vs the proper ETL tool. Cost considersations are very important. You want to have the most important data flexibly available for Data Science team to experiment with, however flexible solutions are typically expensive, so you put ALL of your data into flexible tools, your cloud budget will explode. Managing this tradeoff is the most desirable skill here.
Knowledge of all of the most popular ETL tools - otherwise your recommendations will be biased by what you already know.

Critical for:

Senior Data Scientists
Data Scientists

Resources to improve:

The key here is the experience coming from development and maintenance of large-scale data pipelines, hence the most prominent resource here is the time spend on such problems with top data engineers.
Various online courses on ETL tools

C2c Data Labeling

Generating target values for training data samples. Some industries get it for free (e.g. a history of purchases in an e-commerce website is a natural target for the recommender their system models), while for others it is the most effort-intensive step in their Machine Learning pipelines (e.g. segmentation models for images or videos with many classes).

Critical for:

Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve:

It largely depends on the model class, but if you want to do it yourself, there are tools and techniques to streamline Data Labeling:

If you start completely from scratch, then using simple heuristics will typically get you far (for example based on the image metadata). Remember that for a first, simple model, labels may contain noise, as you will use it to generate higher quality labels later on.
Once you train the first model, it allows pseudo-labelling and iteration to generate higher quality labels
Sometimes you don't have labelled data available for your exact problem, but for a similar problem. Using your creativity, you can take advantage of it. Here is a great example from microscoping images, of moving from weak image-level labels to cell level labels.
For a general data labeling strategy please consider an Active Learning approach

There are also companies that will do you for you (but not for free):

C3. Model training:

C3a State Of The Art (SOTA) Machine Learning models

For many the most exciting part of Data Science, often confused with Data Science or Machine Learning as entire domains. The core skill here is to quickly assess which model class is relevant to solve the task at hand.

If you are not sure which model best suits your use-case, we propose the following heuristic for selecting a model class:

Data type	Proposed Model
Tabular, Time series, Events stream	LGBM
Image, Text , Sound	Neural Networks

The above heuristic is based on winning solutions to open problems on Kaggle, as well as combined experience of authors at gradientcareers. Why Kaggle? We believe in the free market, where all Machine Learning model classes compete on an equal basis, and thousands of teams spend months researching the optimal models. In the table below we summarize the SOTA Machine Learning models based on Kaggle competitions in 2022.

State of the art (SOTA) Machine Learning models in 2022

Competition Name	Numer of competing teams	Data Type	Machine learning problem class	Winning (SOTA) Machine Learning model
Feedback Prize - English Language Learning	2654	Text	Regression	Neural networks: - microsoft-deberta-v3-base - deberta-v3-large - deberta-v2-xlarge - roberta-large - distilbert-base-uncased
Open Problems - Multimodal Single-Cell Integration	1220	Single-cell multiomics data	Multimodal + Correlation	Neural networks - custom architecture
RSNA 2022 Cervical Spine Fracture Detection	883	Images	Classification	Neural networks: - resnet 18d unet - effv2 - convnext tiny - convnext nano - convnext pico - convnext tiny - nfnet l0
Google Universal Image Embedding	1022	Images	Creating image representations	Neural Networks: - ViT-H - ViT-L
Mayo Clinic - STRIP AI	888	Images	Classification	Neural Networks: - swin_large_patch4_window12_384
HuBMAP + HPA - Hacking the Human Body	1175	Images	Segmentation	Neural networks: -1x SegFormer mit-b3 - 2x SegFormer mit-b4 - 1x SegFormer mit-b5 - SegFormer mit-b5
American Express - Default Predictio	4874	Tabular	Classification	Gradient Boosted Decision Trees: - LGBM
Feedback Prize - Predicting Effective Arguments	1557	Text	Classification	Neural networks: - deberta-(v3)-large
Google AI4Code – Understand Code in Python Notebooks	1135	Text	Regression	Neural networks: - deberta-(v3)-large
UW-Madison GI Tract Image Segmentation	1548	Images	Segmentation	Neural networks: 4 unets, with efficientnet b4, b5, b6, b7 backbones
Foursquare - Location Matchin	1079	Tabular	Classification	Gradient Boosted Decision Trees + Neural Networks: - LGBM - Bert
Kore 2022	469	-	Game	Rule-based
JPX Tokyo Stock Exchange Prediction	2033	Time series	Regression	Gradient Boosted Decision Trees: - LGBM
Image Matching Challenge 2022	642	Images	Classification	Neural networks: LoFTR SuperGlue DKM
U.S. Patent Phrase to Phrase Matching	1889	Text	Regression	Neural networks: - microsoft/deberta-v3-large - anferico/bert-for-patents - ahotrod/electra_large_discriminator_squad2_512 - Yanhao/simcse-bert-for-patent - funnel-transformer/large
March Machine Learning Mania 2022 - Men’s	930	Tabular	Classification	Gradient Boosted Decision Trees: - XGB
March Machine Learning Mania 2022 - Women's	651	Tabular	Classification	Gradient Boosted Decision Trees: - LGBM
BirdCLEF 2022	807	Sound	Classification	Neural Networks: - tf_efficientnet_b3_ns - eca_nfnet_l0
H&M Personalized Fashion Recommendations	2952	Events' stream	Recommendation system (classification)	Gradient Boosted Decision Trees: - LGBM
NBME - Score Clinical Patient Notes	1471	Text	Named Entity Recognition	Neural Networks: - deberta-v3-large - deberta-v2-large - deberta-large
Happywhale - Whale and Dolphin Identification	1588	Images	Classification	Neural Networks: - efficientnet_b5 - efficientnet_b6 - efficientnet_b7 - efficientnetv2_m - efficientnetv2_l
Ubiquant Market Prediction	2893	Time series	Regression	Gradient Boosted Decision Trees: - LGBM Neural Networks: - Tabnet

To sum up, the key skills of Data Scientists with regards to SOTA Machine Learning models are as follows :

General knowledge which Machine Learning models are currently State of the Art for a range of different use-cases
Practical, working knowledge with ALL of the SOTA Machine Learning methods allowing to effectively apply them to a task at hand

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve:

Participation in Kaggle competitions is by far and large the best method to thoroughly improve your Machine Learning skills and become truly intimate with the State of the Art Machine Learning models. Spending time on Kaggle, especially for the Machine Learning beginners, is way more productive for learning Data Science than any other activity.

The most significant benefits of Kaggle for learning Data Science:

Kaggle is the Stack Overflow of Data Science. We are sorry to break it for you, but in >80% of the cases (a subjective estimate based on our experience), the exact problem you are trying to solve with Machine Learning has already been solved in Kaggle competitions. Even if the problem is coming from a different domain, if you translate it properly into Data Science frameworks, you will realize that it shares the key aspects with already available solutions. When trying to determine which historical Kaggle competition resembles the problem at hand, we suggest using the following table as a heuristic:

Data Type	Classification	Regression
Tabular	x	x
Time series	x	x
Event stream	x	x
Video	x	x
Text	x	x
Sound	x	x

You should first decide where would you put the x for the problem you are trying to solve, and then look for the exact combination of objective-data type among Kaggle competitions.

Kaggle is truly state of the art. Top solutions 10 break the current world record according to Jeremy Howard.
Kaggle accumulates a database of problems with matching solutions with code. However, in order to take full advantage of it, you need to actually participate in a couple of competitions to understand its mechanics.
Kaggle teaches modesty by allowing you to quickly compare your ideas with “the free market”.
Kaggle shows how different models work on practical problems
If you are a competitive person, you will become addicted and it motivate you to go an extra mile and progress quickly
Kaggle is best to efficiently learn Data Science: first you need to commit a lot of effor to participate on a high level, and after the competition finishes you can read summaries of the top solution with code.
Compared to reading papers, please consider that for every competition the community has already spent thousands of hours studying papers to get an extra 1%. of their Machine Learning models. .

In order to take full advantage of Kaggle and progress quickly, before you start a competition, you may consider completing the below Machine Learning courses and reading one book. Those resources were cited to me by multiple Kaggle Grandmasters as their introductory journeys to Data Science:

Fast AI. The AI course by J. Howard and R. Thomas. It is focused purely on Deep Learning - which will not cover all of your use-cases, but in general it is brilliant (in large part because of the teachers).
The coursera ML course - an overall intro intro Machine Learning by Andrew NG.
The Elements of Statistical Learning - a book by Trevor Hastie, Robert Tibshirani and Jerome Friedman covering the most important basic theoretical aspects of Machine Learning.

C3b Feature Engineering

Feature Engineering is considered the art of Data Science, as it aims to uncover the hidden signal in the data. The modern feature engineering techniques depend mostly on the data type:

For images, video, text and sound data, automated feature extraction using Neural Networks have dominated the field of feature engineering.
For tabular, time series, and event stream data, it is an open question whether to use Neural Networks for automatic feature extraction, or build features manually using heuristics and business insights. Both approaches have pros and cons, as we present below. The typical evolution of Machine Learning systems is to start with manually engineered features + GBDT models to establish a first working system, and then try to beat it with automated feature extraction as the ML system becomes more mature.

This paper offers some intuition on why tree based models outperform deep learning on tabular data.

Manual feature engineering techniques for tabular, time series and event stream data:

Error analysis
Driver Tree analysis
Interviews with industry experts

Comparison of manual vs automated feature engineering for tabular, time series and event stream data

Criterion	Manual feature engineering + GBDT models	Automated feature engineering using Neural Networks
Ability to incorporate existing business insights	Yes	No
Data preprocessing	Light	Heavy
Required data to work	Small	Big
Iteriation cycles	Quick	Slow
SOTA results	Majority	Minority
Guarantee of completeness	No	Yes
Benefit from additional training samples	Lower	Higher
Scability	Worse	Better

Resources to improve

Participation in Machine Learning competitions on Kaggle. Careful analysis of similar competitions typically yields lots of good ideas quickly.
Air BnB paper on using entity embeddings as features for Gradient Boosted Decision Trees model in a recommender system

Critical for:

Senior Data Scientists
Data Scientists

C3c Machine Learning validation and strategic evaluation

Machine Learning validation

Key skills related to Machine Learning validation:

Understanding the importance of it. Proper validation enables iteration. Without it, you cannot make progress
Knowing the train - test - validation split mechanics
Understanding the Target Leak problem
Understanding that any data you would like to use in train, you need to have available at the exact moment of making predictions
Understaning various concepts related to different data structures:
- Time based validation
- User based validation
- Stratified validation

Machine Learning strategic evaluation

The most difficult question about Machine Learning and Data Science that you will ever get is often coming directly from the CTO:

Is our Machine Learning model good or bad?

Answering such a question in a data-driven, non-subjective manner is a skill called Machine Learning evaluation.

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve

Participation in Kaggle competitions. Kaggle teaches you to be very rigorous with the validation of your models.
Benchmarking to your competition. In most cases, you will not know how their models work exactly or what is their accuracy, but you can get surprisingly far by listening to their podcasts or reading their technical blog.
In some industries, for example Adtech, if you employ your creativity, you can reverse-engineer the exact success rates of your competitors' Machine Learning models and use it as a reference point.

C4 Inference and monitoring of Data Science production systems

Inference is about deploying your models which were trained offline, into an online traffic. Key considerations:

Feature mismtach between training and inference
Models complexity - inference determines cost and hence allowed complexity at training time
Monitoring features distribution over time
Resolving incidents - - Funnel analysis. looking at the graphs to identify the closest metric influenced by us, and searching slack on correlated deployments.

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve:

Libraries focused on improving inference performance. For example, Treelite fo GBDT models.
Building your own app that handles real time traffic

C5 Long term development & maintenance of Data Science production systems

It is the broad area covering a range of topics related to all aspects of Data Science systems functioning on production. Truly mastering this skill is the most difficult of all ML production skills, in part because the knowledge here is very sparse, and mostly embedded in few large organisations that are running Machine Learnign systems on production long enough to encounter and tackle those problems.

The key Data Science skills related to long-term Data Science technical vision & development:

Skill	Why difficult
Building the Data Science technical vision, and roadmap and prioritising Machine Learning projects.	Truly big ideas cannot be a/b tested without an expensive prior investment. Consequently, the priotization of Machine Learning projects involves application of judgment and is prone to cognitive biases.
Deploying New Machine Learning Models that were trained on traffic generated by old Machine Learning models	There are many ways for incumbent ML models to accumulate knowledge which is not transferable into new models during training. It is even difficult to realize the magnitude of the challenge here until you have tried yourself to improve any Machine Learning model that has been deployed on production for a long time. For an in-depth discussion of this topic please see this paper.
Servicing data science tech debt	The business orientation on growth typically tramples any other considerations, including reduction of the tech debt. This is magnified by the fact that developing and deploying ML systems is (relatively) fast and cheap, but maintaining them over time is difficult and expensive. Hence, the temptation is always to deploy a new model rather than refactor the code for the old one, leading inevitably to tech debt accumulation. Not all tech debt is bad, but all debt needs to be serviced. This paper discusses it in details.

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers

Resources to improve:

This skills comes mostly from experience, however some practical papers were also published on this topic:

Pinterest paper on the evolution of their recommender system
Google ML rules

E1 Industry Knowledge

A vital, yet often neglected enabler for Data Scientists, who benefit much more from industry knowledge compared to e.g. Software Developers. One of the underlying reasons is that Data Science is still a relatively young domain:

In software development, business analysts/architects are responsible to come up with a solution, and Software Developers are mosly responsible to deploy it
In Data Science, Data Scientists are responsible for both aspects of the system: conceptual problem framing and deployment as well

When discussing this with senior Data Science leaders the sentiment that often we get is following:

"Learning how the industry works is difficult. Training the model is easy."

Being such a valuable skill, it is used as a distinguishing factor for Data Science candidates in their recruitment process. Especially when trying to land your first Data Science job, industry knowledge acquired through own research and individual projects makes even the junior Data Scientist resume stand out among other candidates.

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve

On the job experience: Data Science models built within a particular industry
Individual projects and research

E2 Coding

Conceptually, coding is not a core data science skill, but rather the measure to tell a computer what you want it to do. In a distant future it might theoretically be replaced by code-free solutions, if they become mature enough. At the moment, however, programming pervades every aspect of Data Science and apt coding skills are vital for proficient Data Scientists. Various programming languanges play their roles at different stages of Data Science systems lifecycle:

Data Preparation and Activation: SQL
Model Training: Python (Flexibility)
Inference: Scala / Java / C++ (Scalability)

Critical for:

Data Science Managers
Senior Data Scientists
Data Scientists

Resources to improve:

Courses at datacamp.com and other platforms, which focus on use-cases for Data Science:

E3 Data Science Talent Management & Team Management

Arguably the most important skill for senior Data Science managers. It involves answering the following questions:

How to attract skilled Data Scientists to the organization? -- Where are the biggest pools of candidates? -- What should the be value proposition of the organization for Data Scientists?
How should the Data Science recruitment process should like?
How should the Data Science onboarding look like?
How should the Data Science career ladder look like?
What should be the main KPIs for Data Scientists?
How to keep the Data Science team excited?
How to resolve conflicts between Data Scientists?
How to structure the long-term Data Science career progression to align value generated with the total compensation?
How to prevent churn of the top Data Scientists in the team?

The answer to the above questions depends largely on the context of a given organization, and requires both Data Science skills as well as team management skills.

Critical for:

Chief Data Scientists
Data Science Directors
Data Science Managers

Resources to improve:

This skill comes mostly from experience, therefore the most useful resource here is the experience from reputable Data Science organizations and Data Science senior leaders who are known to do these things well.