Top 100 Data science interview questions and answers Updated 2019
If you want to succeed in as data scientist, you need to have knowledge on statistics , linear algebra, data science, machine learning. You want to be be expert you learn about deep learning, Natural language processing, Computer vision and more.
Learn Data Science from Johns Hopkins University on Coursera. #1 Specialization on Coursera. Enroll online today!
Statistics interview questions:
What is Statistics ?
- It is a branch of mathematics pertaining to the collection, analysis, interpretation, and presentation of masses of numerical data.
How many Types of statistics are there ?
- Descriptive Statistics
- Inferential Statistics
What is Descriptive statistics ?
It is help to organize data and focus on the main characteristic of the data and it’s also provides a summary of he data numerically and graphically. (mean, mode, standard deviation, correlation)
What is inferential statistics ?
It generates the larger data and applies probability theory to draw a conclusion
What is mean value in statistics ?
Mean is the average value of the data set.
What is Mode value in statistics ?
The Most repeated value in the data set
What is median value in statistics ?
The middle value from data set
What is Variance in statistics ?
Variance measures how far each number in the set is from the mean.
What is standard Deviation in statistics ?
It is a square root of variance
How many types of variables are there in statistics ?
- Categorical variable
- Confounding variable
- Continuous variable
- Control variable
- Dependent variable
- Discrete variable
- Independent variable
- Nominal variable
- Ordinal variable
- Qualitative variable
- Quantitative variable
- Random variables
- Ratio variables
- ranked variables
How many types of distributions are there ?
- Bernoulli Distribution
- Uniform Distribution
- Binomial Distribution
- Normal Distribution
- Poisson Distribution
- Exponential Distribution
What is normal distribution ?
A) It’s like a bell curve distribution. Mean, Mode and Medium are equal in this distribution. Most of the distributions in statistics are normal distribution.
What is standard normal distribution ?
If mean is 0 and standard deviation is 1 then we call that distribution as standard normal distribution.
What is Binominal Distribution ?
A distribution where only two outcomes are possible, such as success or failure and where the probability of success and failure is same for all the trials then it is called a Binomial Distribution
What is Bernoulli distribution ?
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial.
What is Poisson distribution ?
A distribution is called Poisson distribution when the following assumptions are true:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
What is central limit theorem ?
a) Mean of sample means is closely to the mean of the population
b) Standard deviation of the sample distribution can be found out from the population standard deviation divided by square root of sample size N and it is also known as standard error of means.
c) if the population is not normal distribution, but the sample size is greater than 30 the sampling distribution of sample means approximates a normal distribution
What is P Value, How it’s useful ?
The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event.
- If The p-value is less than 0.05 (p<=0.05), It indicates strong evidence against the null hypothesis, you can reject the Null Hypothesis
- If the P-value is higher than 0.05 (p>0.05), It indicates weak evidence against the null hypothesis, you can fail to reject the null Hypothesis
What is Z value or Z score (Standard Score) , How it’s useful ?
Z score indicates how many standard deviations on element is from the mean. It is also called standard score.
Z score Formula
z = (X – μ) / σ
- It is useful in Statistical testing.
- Z-value is range between -3 to 3.
- Its useful to find the outliers in large data
What is T-Score, What is the use of it ?
- It is a ratio between the difference between two groups and the difference within the groups. The larger t score, the more difference there is between groups. The smaller t-score means the more similarity between groups.
- We can use t-score when the sample size is less than 30, It is used in statistical testing
What is IQR ( Interquartile Range ) and Usage ?
- It is difference between 75th and 25th percentiles, or between upper and lower quartiles,
- It is also called Misspread data or Middle 50%.
- Mainly to find outliers in data, if the observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR those are considered as outliers.
Formula IQR = Q3-Q1
What is Hypothesis Testing ?
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.
How many Types of Hypothesis Testing are there ?
- Null Hypothesis, Alternative Hypothesis
What is Type 1 Error ?
FP – False Positive ( In statistics it is the rejection of a true null hypothesis)
What is Type 2 Error ?
FN – False Negative ( In statistics it is failing to reject a false null hypothesis)
What is population ?
It is a discrete group of people, animals or things that can be identified by at least one common characteristic for the purposes of data collection and analysis
What is sampling ?
Sampling is a process used in statistical analysis in which a predetermined number of observations are taken from a larger population
Types of sampling techniques ?
There are two major types of sampling
1. PROBABILITY SAMPLING
- Simple Random Sampling
- Stratified Random Sampling
- Systematic Sampling
- Cluster Sampling
- Multi-stage Sampling
2. NON-PROBABILITY SAMPLING
- Purposive Sampling
- Convenience Sampling
- Snow-ball Sampling
- Quota Sampling
What is Sample Bias ?
It is a type of bias caused by choosing non-random data for statistical analysis
What is Selection Bias ?
Selection bias is usually introduced as an error with the sampling and having a selection for analysis that is not properly randomized
What is Univariate, Bivariate, Multi Variate Analysis ?
Univarite means single variable – Analysis on single variable data
Bivariate means two variables – you can do analysis on multiple variables
Mutli Variate means multiple variables – Analysis on multiple variables
Linear Algebra Interview Questions:
What is Eigenvalues and Eigenvectors ?
Data Science and Machine learning Interview Questions:
What is data science ?
Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies. Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization’s competitive advantage.
What is Machine learning ?
Machine learning is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task
What is Deep learning ?
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
What is Supervised learning ?
The data is labeled. And the algorithms learn from data to predict the output. Then we call it as supervised learning.
What is Unsupervised learning ?
Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized.
What is Reinforcement learning ?
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward
What is Transfer learning ?
Transfer learning make use of the knowledge gained while solving one problem and applying it to a different but related problem.
What is Regression ?
In Statistics, a measure of the relation between the mean value of one variable (e.g. output) and corresponding values of other variables
What is Classification ?
In machine learning and statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known
What is Clustering ?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups
What is Bias ?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
What is Variance. ?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
What is EDA ?
Exploratory data analysis : In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
What is Overfitting, Underfitting and Trade-off ?
Overfitting – The model works fine on training data not performing well on test data.
Underfitting- The model not able to understand patters in data
Trade-off – We need to balance bias and variance
What are steps in Building a Machine learning Model ?
- Problem Statement
- Gathering Data
- Data Preparation
- Model Training
- Performance Tuning
- Model Deployment
What is Data Pre-processing ?
Data preprocessing is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc
What is Data Cleaning ?
What is Data Preparation ?
What is Data munging ?
What is Standardization and normalization ?
Converting variables from different ranges to same scale
How to deal with Missing Values In Data ?
It’s depends on type of data, you can fill with mean or median values, if the missing data is very less you can remove.
How to find outliers in data ?
You can find outliers in data by using box plot graphs, If the data is large, we can z values range from -3 to 3, We can also find using IQR -1.5 to 1.5.
How many types of Regression algorithms are there ?
- Linear Regression
- Logistic Regression
- Polynomial Regression
- Stepwise Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
What is Linear Regression, How it works, When to Use ?
What is Logistic Regression, How it works, When to Use ?
What is Support vector machine, How it works, When to Use ?
What is SVR ( Support vector Regressor ) ?
What is SVC ( Support Vector Classification ) ?
What is KNN( K nearest neighbour algorithm ) ?
Knn is a supervised learning algorithm,
How to choose k value in KNN ?
sqrt(n) : n is the number of samples
What is Ecludien distance ?
What is Naive bayes algorithm ? How it works ?
What is ensemble learning?
What is Decision Tree algorithm ? How the tree will split ?
What is Random Forest algorithm ? How to pick no of trees ?
What is Bagging ?
What is Boosting ?
How many Types of boosting algorithms are there ?
- Gradient Boosting
What is xgboost algorithm ?
What is Adaboost Algorithm ?
What is Gradient Boosting algorithm ?
How Gradient Boosting helps to optimize the cost function.
What is Time series ?
How many types of algorithms in time series ?
What is ARIMA Model(Auto regressive and Moving average ) ?
What is Customer segmentation ? How can do it with Machine learning ?
What is K- Means ?
K means Clustering is unsupervised algorithm to determine the best possible clusters from the data. The goal of the algorithm to find groups with in data.
How choose K value in Kmeans algorithm ?
We can use the elbow method to determine the optimal number of clusters( Kvalue)
How many types of clustering techniques are there ?
- Partitioning methods.
- Hierarchical clustering.
- Fuzzy clustering.
- Density-based clustering.
- Model-based clustering.
What is Hierarchical clustering ?
What is Dimentionality Reduction ? How it works ?
How many types dimentionality reduction techniques ?
What is PCA ? ( Principal component analysis )
Principal component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. It’s mainly used to reduce dimentionality of data set.
What type of metrics in Regression ?
RMSE – Root Mean square error
MSE – Mean square error
MAE – Mean absolute Error
How to improve the model accuracy ?
By using Feature selection, Dimensionality reduction, Ensemble methods(bagging and boosting algorithms) and Hyper parameter tuning.
How many types of loss function or cost function in machine learning ?
- log loss
- focal loss
- KL Divergence/Relative entropy
- Exponential loss
- Hinge Loss
- mean square error
- mean absolute error
- huber loss/ smooth mean absolute error
- log cosh loss
- quantile loss
Which one you prefer model performance an model accuracy while building model ?
I can use model performance, model accuracy is the subset of model performance.
What is Mean square error , formula and criteria ?
What is Root Mean Square error ?
What is R2 score. ?
What type of metrics in Classification ?
- Confusion Matrix = ((TP + FN)/(FP + TN))
- Accuracy score = (TP+TN)/TP+TN+FP+FN
- Recall , True positive rate, – ( TP/TP+FN)
- Precision – (TP/TP+TN)
- F1score = 2(precision*recall)/precision+recall
How can you overcome from overfitting ?
How can you overcome from underfitting ?
What is Meant by normalization ?
What is meant by dummy variables ?
What is Regularization ?
What is Different L1 Regularization and L2 Regularization ?
How can you deal with different types of seasonality in time series modelling ?
What is Multicollinerity ?
What is ROC Curve ?
What is Sigmod Function ?
Which one i have to learn for Data science Python or R programming language ?
What is Data visualization with different Charts in Python ?
2. Bar plots
4. Pie Chart
5. Scatter Plot
6. Box plots
What is best programming libraries of machine learning.
- R, Python, numpy, scikitlearn, pandas, Scikit Learn, Tensorflow, Keras, Pytorch, Matplotlib, Seaborn
What is the current version of python ?
Why Python for data science ?
What is difference between lists and tuples ?
How can do webscaping in Python ?
What are libraries in python ?
- Scikit Learn
What is scikit learn library ?
What is scipy library ?
Numpy Interview questions and Answers
What is Numpy Arrays ?
What is different between numpy and arrays ?
Pandas Interview Questions and Answers
What is Pandas ?
What is use of Pandas ?
How to find duplicate values and remove in data by using pandas?
How to find null values in data by using pandas ?
How to sort data using pandas ?
How you can fill null values ?
How to convert string to date object ?
Deep Learning Interview Questions
What is Neural Network ?
Types of Neural Network ?
What is MLP ?
What is CNN ?
What Is RNN ?
What is LSTM ?
What is GRU ?
Why LSTM is better than Recurrent Neural Network ?
What are encoders ?
What is GANS ?
What is Deep Belief Network ?
What is Activation function ?
How many types of Activation Functions are there ?
What is Dropout in NN ?
What are unsupervised Learning Algorithms in deep learning .
Natural Language Processing Interview questions.
What NLP ?
What is tokenizing ?
What is Stemming ?
What is Lemmatizing ?
What is POS Tagging ?
What is Genism ?
What is Word2Vec Model ?
What is BiGram, Trigram ?
What are some applications of Machine learning ?
When to use deep learning methods ?
Tensorflow Interview questions and Answers:
What is tensorflow ?
What is tensor ?
What is session ?
What is constant in tensorflow ?
What is tensorboard ?
Tableau Interview Questions:
What is Tableau ?
What is difference between tableau and power BI ?
SQl inteview Questions:
What is Sql ?
Type of Joins in SQL ?
Types of Clauses in SQL ?
These are not mandatory datascience interview, But good to know.
Read more 100+ SQL inteview Questions
Learn Data Science from Johns Hopkins University on Coursera. #1 Specialization on Coursera. Enroll online today!
Conclusion : In every interview, for experienced persons will get question about tell your projects are they regression, classification, clustering or timeseries with start to end like problem statement, how you pull data , how you have done preprocessing, have you done EDA, how you have apply alogorithm, How you have improve the model performance like that.
Note : Do one or two project from scratch, you will be ready for interview. Best of luck.