Data Science Skills

Updated on

This article discusses the data science skills necessary at every stage of the lifecycle of production-level data science and machine learning systems. It is a result of a cumulative professional experience of authors at JobsInData.

There are two types of data science skills: core skills and enablers. Both are critical for data scientists to master, although some may be more important depending on your current data science career level.

Compared to other skill lists for data scientists, the top data science skills list presented in this article focuses on the following aspects:

  • Skills needed for production-grade data science systems
  • Skills enabling maximization of your lifetime data science salary
  • Skills required for different stages of a data science career path and their impact on career progression
  • Data science resources to improve your data science resume

Overview - the Data Science Skills Matrix

Data Science Skills

The core Data Science skills:

The enabling Data Science skills:

Let's deep-dive into every skill.

C1 Transforming business problems into data science frameworks

The ability to translate the needs of product teams and business teams into practical, solvable Machine Learning problems. In particular, it involves answering the following questions: - What is the target we are trying to predict? - What are the possible features? - Is there a logical causality between features and the target? - What is the logical path from the model outcomes into business decisions?

An illustrative example from e-commerce recommender systems (simplified for the ease of exposition):

Business Problem: Personalise user interaction with ecommerce website: show the most relevant items to maximise the likelihood of conversion weighted by the item value. The underlying goal is to maximise the GMV - gross merchandise value.

Model specification: Target: Given that the user was shown the product, has she bought it or not? (Classification problem) Features: The historical user’s purchases up until the moment of the given impression expressed as a product-level buy2buy matrix.

Causality between features and target: Yes. Historical purchases give insight into a user's personal taste and needs, and hence could be a good predictor for future purchases.

Logical path from the model into business decision: At the moment of the request for product recommendations, rank and display Top N products according to the following formula:

E(GMV_i) = P(C_i) * Price_i
#where: 
#E(GMV_i) - expected Gross Merchandise Value from showing product i to the user
#P(C_i) - probability of conversion for product i (the model’s output)
#P_i - price of product i

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers

Resources to improve:

This is a skill that comes mostly from experience & industry knowledge. However, there are quality materials out there on this topic:

C2. Data preparation & activation

C2a Data sources extension

Data sources extension is about bringing new data sources relevant to the problem. It is connected to, yet conceptually differt from Feature Engineering:

  • Feature Engineering is about extracting the signal from the existing data
  • Data sources extension is about adding new data to the problem. It might come in different forms:
    • More records of the same nature
    • New characteristics to enrich existing records in the database

Data sources externsion is generally more subtle and complicated than feature engineering, but unlocks signals that were not yet present in the dataset.

Critical for:

  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve:

  • This skill comes with industry knowledge and discussions with industry experts
  • A careful consideration of all possible signals influencing the target in the model, potentially using consulting - like tools like a driver tree analysis

C2b ETL - extract, transform & load

Loading the data and preparing if for analysies - visualizations, feature engineering and model training. The key skills here are:

  • Matching the data type vs business need vs the proper ETL tool. Cost considersations are very important. You want to have the most important data flexibly available for Data Science team to experiment with, however flexible solutions are typically expensive, so you put ALL of your data into flexible tools, your cloud budget will explode. Managing this tradeoff is the most desirable skill here.
  • Knowledge of all of the most popular ETL tools - otherwise your recommendations will be biased by what you already know.

Critical for:

  • Senior Data Scientists
  • Data Scientists

Resources to improve:

  • The key here is the experience coming from development and maintenance of large-scale data pipelines, hence the most prominent resource here is the time spend on such problems with top data engineers.
  • Various online courses on ETL tools

C2c Data Labeling

Generating target values for training data samples. Some industries get it for free (e.g. a history of purchases in an e-commerce website is a natural target for the recommender their system models), while for others it is the most effort-intensive step in their Machine Learning pipelines (e.g. segmentation models for images or videos with many classes).

Critical for:

  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve:

It largely depends on the model class, but if you want to do it yourself, there are tools and techniques to streamline Data Labeling:

  • If you start completely from scratch, then using simple heuristics will typically get you far (for example based on the image metadata). Remember that for a first, simple model, labels may contain noise, as you will use it to generate higher quality labels later on.
  • Once you train the first model, it allows pseudo-labelling and iteration to generate higher quality labels
  • Sometimes you don't have labelled data available for your exact problem, but for a similar problem. Using your creativity, you can take advantage of it. Here is a great example from microscoping images, of moving from weak image-level labels to cell level labels.
  • For a general data labeling strategy please consider an Active Learning approach

There are also companies that will do you for you (but not for free):

C3. Model training:

C3a State Of The Art (SOTA) Machine Learning models

For many the most exciting part of Data Science, often confused with Data Science or Machine Learning as entire domains. The core skill here is to quickly assess which model class is relevant to solve the task at hand.

If you are not sure which model best suits your use-case, we propose the following heuristic for selecting a model class:

Data type Proposed Model
Tabular, Time series, Events stream LGBM
Image, Text , Sound Neural Networks

The above heuristic is based on winning solutions to open problems on Kaggle, as well as combined experience of authors at gradientcareers. Why Kaggle? We believe in the free market, where all Machine Learning model classes compete on an equal basis, and thousands of teams spend months researching the optimal models. In the table below we summarize the SOTA Machine Learning models based on Kaggle competitions in 2022.

State of the art (SOTA) Machine Learning models in 2022

Competition Name Numer of competing teams Data Type Machine learning problem class Winning (SOTA) Machine Learning model
Feedback Prize - English Language Learning 2654 Text Regression Neural networks: - microsoft-deberta-v3-base - deberta-v3-large - deberta-v2-xlarge - roberta-large - distilbert-base-uncased
Open Problems - Multimodal Single-Cell Integration 1220 Single-cell multiomics data Multimodal + Correlation Neural networks - custom architecture
RSNA 2022 Cervical Spine Fracture Detection 883 Images Classification Neural networks: - resnet 18d unet - effv2 - convnext tiny - convnext nano - convnext pico - convnext tiny - nfnet l0
Google Universal Image Embedding 1022 Images Creating image representations Neural Networks: - ViT-H - ViT-L
Mayo Clinic - STRIP AI 888 Images Classification Neural Networks: - swin_large_patch4_window12_384
HuBMAP + HPA - Hacking the Human Body 1175 Images Segmentation Neural networks: -1x SegFormer mit-b3 - 2x SegFormer mit-b4 - 1x SegFormer mit-b5 - SegFormer mit-b5
American Express - Default Predictio 4874 Tabular Classification Gradient Boosted Decision Trees: - LGBM
Feedback Prize - Predicting Effective Arguments 1557 Text Classification Neural networks: - deberta-(v3)-large
Google AI4Code – Understand Code in Python Notebooks 1135 Text Regression Neural networks: - deberta-(v3)-large
UW-Madison GI Tract Image Segmentation 1548 Images Segmentation Neural networks: 4 unets, with efficientnet b4, b5, b6, b7 backbones
Foursquare - Location Matchin 1079 Tabular Classification Gradient Boosted Decision Trees + Neural Networks: - LGBM - Bert
Kore 2022 469 - Game Rule-based
JPX Tokyo Stock Exchange Prediction 2033 Time series Regression Gradient Boosted Decision Trees: - LGBM
Image Matching Challenge 2022 642 Images Classification Neural networks: LoFTR SuperGlue DKM
U.S. Patent Phrase to Phrase Matching 1889 Text Regression Neural networks: - microsoft/deberta-v3-large - anferico/bert-for-patents - ahotrod/electra_large_discriminator_squad2_512 - Yanhao/simcse-bert-for-patent - funnel-transformer/large
March Machine Learning Mania 2022 - Men’s 930 Tabular Classification Gradient Boosted Decision Trees: - XGB
March Machine Learning Mania 2022 - Women's 651 Tabular Classification Gradient Boosted Decision Trees: - LGBM
BirdCLEF 2022 807 Sound Classification Neural Networks: - tf_efficientnet_b3_ns - eca_nfnet_l0
H&M Personalized Fashion Recommendations 2952 Events' stream Recommendation system (classification) Gradient Boosted Decision Trees: - LGBM
NBME - Score Clinical Patient Notes 1471 Text Named Entity Recognition Neural Networks: - deberta-v3-large - deberta-v2-large - deberta-large
Happywhale - Whale and Dolphin Identification 1588 Images Classification Neural Networks: - efficientnet_b5 - efficientnet_b6 - efficientnet_b7 - efficientnetv2_m - efficientnetv2_l
Ubiquant Market Prediction 2893 Time series Regression Gradient Boosted Decision Trees: - LGBM Neural Networks: - Tabnet

To sum up, the key skills of Data Scientists with regards to SOTA Machine Learning models are as follows :

  • General knowledge which Machine Learning models are currently State of the Art for a range of different use-cases
  • Practical, working knowledge with ALL of the SOTA Machine Learning methods allowing to effectively apply them to a task at hand

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve:

Participation in Kaggle competitions is by far and large the best method to thoroughly improve your Machine Learning skills and become truly intimate with the State of the Art Machine Learning models. Spending time on Kaggle, especially for the Machine Learning beginners, is way more productive for learning Data Science than any other activity.

The most significant benefits of Kaggle for learning Data Science:

  • Kaggle is the Stack Overflow of Data Science. We are sorry to break it for you, but in >80% of the cases (a subjective estimate based on our experience), the exact problem you are trying to solve with Machine Learning has already been solved in Kaggle competitions. Even if the problem is coming from a different domain, if you translate it properly into Data Science frameworks, you will realize that it shares the key aspects with already available solutions. When trying to determine which historical Kaggle competition resembles the problem at hand, we suggest using the following table as a heuristic:
Data Type Classification Regression
Tabular x x
Time series x x
Event stream x x
Video x x
Text x x
Sound x x

You should first decide where would you put the x for the problem you are trying to solve, and then look for the exact combination of objective-data type among Kaggle competitions.

  • Kaggle is truly state of the art. Top solutions 10 break the current world record according to Jeremy Howard.
  • Kaggle accumulates a database of problems with matching solutions with code. However, in order to take full advantage of it, you need to actually participate in a couple of competitions to understand its mechanics.
  • Kaggle teaches modesty by allowing you to quickly compare your ideas with “the free market”.
  • Kaggle shows how different models work on practical problems
  • If you are a competitive person, you will become addicted and it motivate you to go an extra mile and progress quickly
  • Kaggle is best to efficiently learn Data Science: first you need to commit a lot of effor to participate on a high level, and after the competition finishes you can read summaries of the top solution with code.
  • Compared to reading papers, please consider that for every competition the community has already spent thousands of hours studying papers to get an extra 1%. of their Machine Learning models. .

In order to take full advantage of Kaggle and progress quickly, before you start a competition, you may consider completing the below Machine Learning courses and reading one book. Those resources were cited to me by multiple Kaggle Grandmasters as their introductory journeys to Data Science:

  • Fast AI. The AI course by J. Howard and R. Thomas. It is focused purely on Deep Learning - which will not cover all of your use-cases, but in general it is brilliant (in large part because of the teachers).
  • The coursera ML course - an overall intro intro Machine Learning by Andrew NG.
  • The Elements of Statistical Learning - a book by Trevor Hastie, Robert Tibshirani and Jerome Friedman covering the most important basic theoretical aspects of Machine Learning.

C3b Feature Engineering

Feature Engineering is considered the art of Data Science, as it aims to uncover the hidden signal in the data. The modern feature engineering techniques depend mostly on the data type:

  • For images, video, text and sound data, automated feature extraction using Neural Networks have dominated the field of feature engineering.
  • For tabular, time series, and event stream data, it is an open question whether to use Neural Networks for automatic feature extraction, or build features manually using heuristics and business insights. Both approaches have pros and cons, as we present below. The typical evolution of Machine Learning systems is to start with manually engineered features + GBDT models to establish a first working system, and then try to beat it with automated feature extraction as the ML system becomes more mature.

This paper offers some intuition on why tree based models outperform deep learning on tabular data.

Manual feature engineering techniques for tabular, time series and event stream data:

  • Error analysis
  • Driver Tree analysis
  • Interviews with industry experts

Comparison of manual vs automated feature engineering for tabular, time series and event stream data

Criterion Manual feature engineering + GBDT models Automated feature engineering using Neural Networks
Ability to incorporate existing business insights Yes No
Data preprocessing Light Heavy
Required data to work Small Big
Iteriation cycles Quick Slow
SOTA results Majority Minority
Guarantee of completeness No Yes
Benefit from additional training samples Lower Higher
Scability Worse Better

Resources to improve

  • Participation in Machine Learning competitions on Kaggle. Careful analysis of similar competitions typically yields lots of good ideas quickly.
  • Air BnB paper on using entity embeddings as features for Gradient Boosted Decision Trees model in a recommender system

Critical for:

  • Senior Data Scientists
  • Data Scientists

C3c Machine Learning validation and strategic evaluation

Machine Learning validation

Key skills related to Machine Learning validation:

  • Understanding the importance of it. Proper validation enables iteration. Without it, you cannot make progress
  • Knowing the train - test - validation split mechanics
  • Understanding the Target Leak problem
  • Understanding that any data you would like to use in train, you need to have available at the exact moment of making predictions
  • Understaning various concepts related to different data structures:
    • Time based validation
    • User based validation
    • Stratified validation

Machine Learning strategic evaluation

The most difficult question about Machine Learning and Data Science that you will ever get is often coming directly from the CTO:

  • Is our Machine Learning model good or bad?

Answering such a question in a data-driven, non-subjective manner is a skill called Machine Learning evaluation.

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve

  • Participation in Kaggle competitions. Kaggle teaches you to be very rigorous with the validation of your models.
  • Benchmarking to your competition. In most cases, you will not know how their models work exactly or what is their accuracy, but you can get surprisingly far by listening to their podcasts or reading their technical blog.
  • In some industries, for example Adtech, if you employ your creativity, you can reverse-engineer the exact success rates of your competitors' Machine Learning models and use it as a reference point.

C4 Inference and monitoring of Data Science production systems

Inference is about deploying your models which were trained offline, into an online traffic. Key considerations:

  • Feature mismtach between training and inference
  • Models complexity - inference determines cost and hence allowed complexity at training time
  • Monitoring features distribution over time
  • Resolving incidents - - Funnel analysis. looking at the graphs to identify the closest metric influenced by us, and searching slack on correlated deployments.

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve:

  • Libraries focused on improving inference performance. For example, Treelite fo GBDT models.
  • Building your own app that handles real time traffic

C5 Long term development & maintenance of Data Science production systems

It is the broad area covering a range of topics related to all aspects of Data Science systems functioning on production. Truly mastering this skill is the most difficult of all ML production skills, in part because the knowledge here is very sparse, and mostly embedded in few large organisations that are running Machine Learnign systems on production long enough to encounter and tackle those problems.

The key Data Science skills related to long-term Data Science technical vision & development:

Skill Why difficult
Building the Data Science technical vision, and roadmap and prioritising Machine Learning projects. Truly big ideas cannot be a/b tested without an expensive prior investment. Consequently, the priotization of Machine Learning projects involves application of judgment and is prone to cognitive biases.
Deploying New Machine Learning Models that were trained on traffic generated by old Machine Learning models There are many ways for incumbent ML models to accumulate knowledge which is not transferable into new models during training. It is even difficult to realize the magnitude of the challenge here until you have tried yourself to improve any Machine Learning model that has been deployed on production for a long time. For an in-depth discussion of this topic please see this paper.
Servicing data science tech debt The business orientation on growth typically tramples any other considerations, including reduction of the tech debt. This is magnified by the fact that developing and deploying ML systems is (relatively) fast and cheap, but maintaining them over time is difficult and expensive. Hence, the temptation is always to deploy a new model rather than refactor the code for the old one, leading inevitably to tech debt accumulation. Not all tech debt is bad, but all debt needs to be serviced. This paper discusses it in details.

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers

Resources to improve:

This skills comes mostly from experience, however some practical papers were also published on this topic:

E1 Industry Knowledge

A vital, yet often neglected enabler for Data Scientists, who benefit much more from industry knowledge compared to e.g. Software Developers. One of the underlying reasons is that Data Science is still a relatively young domain:

  • In software development, business analysts/architects are responsible to come up with a solution, and Software Developers are mosly responsible to deploy it
  • In Data Science, Data Scientists are responsible for both aspects of the system: conceptual problem framing and deployment as well

When discussing this with senior Data Science leaders the sentiment that often we get is following:

"Learning how the industry works is difficult. Training the model is easy."

Being such a valuable skill, it is used as a distinguishing factor for Data Science candidates in their recruitment process. Especially when trying to land your first Data Science job, industry knowledge acquired through own research and individual projects makes even the junior Data Scientist resume stand out among other candidates.

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve

  • On the job experience: Data Science models built within a particular industry
  • Individual projects and research

E2 Coding

Conceptually, coding is not a core data science skill, but rather the measure to tell a computer what you want it to do. In a distant future it might theoretically be replaced by code-free solutions, if they become mature enough. At the moment, however, programming pervades every aspect of Data Science and apt coding skills are vital for proficient Data Scientists. Various programming languanges play their roles at different stages of Data Science systems lifecycle:
  • Data Preparation and Activation: SQL
  • Model Training: Python (Flexibility)
  • Inference: Scala / Java / C++ (Scalability)

Critical for:

  • Data Science Managers
  • Senior Data Scientists
  • Data Scientists

Resources to improve:

Courses at datacamp.com and other platforms, which focus on use-cases for Data Science:

E3 Data Science Talent Management & Team Management

Arguably the most important skill for senior Data Science managers. It involves answering the following questions:

  • How to attract skilled Data Scientists to the organization? -- Where are the biggest pools of candidates? -- What should the be value proposition of the organization for Data Scientists?
  • How should the Data Science recruitment process should like?
  • How should the Data Science onboarding look like?
  • How should the Data Science career ladder look like?
  • What should be the main KPIs for Data Scientists?
  • How to keep the Data Science team excited?
  • How to resolve conflicts between Data Scientists?
  • How to structure the long-term Data Science career progression to align value generated with the total compensation?
  • How to prevent churn of the top Data Scientists in the team?

The answer to the above questions depends largely on the context of a given organization, and requires both Data Science skills as well as team management skills.

Critical for:

  • Chief Data Scientists
  • Data Science Directors
  • Data Science Managers

Resources to improve:

This skill comes mostly from experience, therefore the most useful resource here is the experience from reputable Data Science organizations and Data Science senior leaders who are known to do these things well.