Machine Learning Interview Questions and Answers

Get Ready for Your Dream Job: Click, Learn, Succeed, Start Now!

In the last article, we saw some of the interview questions that the recruiters might ask for the Machine Learning Interview. Let’s take a look at more Machine Learning Interview Questions and Answers.

Machine Learning Interview Questions and Answers

1. List out the factors that make boosting a more stable algorithm in comparison with other ensemble algorithms?

Boosting tends to focus extensively on errors that it finds in previous iterations till the time that they become obsolete. On the other hand, in bagging there exist no corrective loop. This factor makes boosting a more stable algorithm compared to other ensemble algorithms.

2. List out the ways that assist you to handle outliers in the data?

We are able to look out for outliers using tools and functions like box plot, scatter plot, Z-Score, IQR score, and many others. After identification, we are able to handle them on the basis of the visualization we have gained. In order to handle outliers, we could cap at some threshold value, make use of transformations to reduce the skewness of the data, and eliminate outliers if they are anomalies or errors.

3. What are some of the popular cross-validation techniques that you know?

There exist mainly six kinds of cross-validation techniques. They are

  • K fold
  • Stratified k fold
  • Leave one out
  • Bootstrapping
  • Random search cv
  • Grid search cv

4. Describe a popular dimensionality reduction algorithm of your choice.

Some of the popular dimensionality reduction algorithms are Principal Component Analysis, or PCA, and Factor Analysis.
Principal Component Analysis tends to fabricate one or more index variables with the help of a larger set of measured variables. Factor Analysis, on the other hand, is a model for the measurement of a latent variable.

5. How are we able to make use of a dataset without the target variable in supervised learning algorithms?

We are able to input the data set into a clustering algorithm, produce optimal clusters, label these cluster numbers by the name of the new target variable. As the next step, the dataset now associates independent and target variables present within itself. This makes sure that the dataset is ready for its usage in supervised learning algorithms.

6. What are the different types of popular recommendation systems that you know? Name and describe two personalized recommendation systems and even comment on their ease of implementation.

Popularity-based recommendation, content-based recommendation, user-based collaborative filter, and item-based recommendation collectively account for some of the popular types of recommendation systems.

Personalized Recommendation systems tends to be Content-based recommendation, user-based collaborative filter, and item-based recommendation.

7. What factors could help us in dealing with sparsity issues in recommendation systems? How do you think that we are able to measure its effectiveness?

We can make use of the Singular value decomposition for generating the prediction matrix. RMSE accounts as the measure that makes it possible for us to understand how close the prediction matrix is to the original matrix.

8. List out the techniques that we can use to look out for similarities in the recommendation system.

Pearson correlation and Cosine correlation are some of the techniques that we can use in order to find similarities in recommendation systems.

9. List out the limitations associated with Fixed Basis Function.

Linear separability in feature space is not able to imply linear separability in input space. As a consequence of this, the algorithms non-linearly transform the inputs with the help of vectors of basic functions with increased dimensionality.

Some of the limitations of Fixed basis functions are listed below:

a. Non-Linear transformations are not able to eliminate overlap between two classes, however, they are able to magnify the overlap.

b. In certain situations, it becomes quite troublesome to identify which basis functions are the most accurate fit for a given task. As a consequence of this, learning the basic functions could be beneficial over the usage of fixed basis functions.

10. Describe and elaborate the concept of Inductive Bias with the help of some examples.

Inductive Bias accounts to be a set of assumptions that humans could make use of in order to predict outputs given inputs that the learning algorithm has not seen yet. When we attempt to learn Y from X and the hypothesis space for Y ranges from a fixed value to infinite, we demand to significantly reduce the scope by our beliefs/assumptions regarding the hypothesis space which we know by the name of inductive bias.

11. What do you understand by the term instance-based learning?

Instance-Based Learning accounts to be a set of procedures for regression and classification which is able to generate a class label prediction on the basis of the resemblance to its nearest neighbors present in the training data set. These algorithms have the purpose of collecting all the data and receiving an answer when we demand it or query.

12. From the point of view of train and test split criteria, what is your view on performing scaling before the split or after the split?

We should prefer scaling to be done post-train and test split ideally. In case the algorithm closely packs the data, then scaling post or pre-split should not generate much difference.

13. How would you explain the Bayes’ Theorem? Describe at least 1 use case within the machine learning context.

Bayes’ Theorem tends to describe the probability of an event, on the basis of prior knowledge of conditions that might relate themselves to the event. For example, in the case of cancer relates itself to age, then, by making use of Bayes’ theorem, we can make use of a person’s age to more accurately assess the probability that they have cancer.

14. What do you understand by Naive Bayes? Why do you think is it Naive?

Naive Bayes classifiers account for a series of classification algorithms that establishes their foundation on the Bayes theorem. This family of algorithms tends to share a common principle which treats almost every pair of features independently while the algorithm tries to classify them.
Naive Bayes is Naive because the attributes present in it, are independent of others in the same class. This absence of dependence between two attributes of the same class tends to fabricate the quality of naiveness.

15. How would you describe the working of a Naive Bayes Classifier?

Naive Bayes classifiers account to be a family of algorithms that find their basis from the Bayes theorem of probability. It tends to work on the fundamental assumption which states that every set of two features that the algorithm classifies is independent of each other and every feature tends to make an equal and independent contribution to the outcome of the algorithm.

16. What do you understand by the terms prior probability and marginal likelihood in the context of the Naive Bayes theorem?

We can consider that prior probability as the percentage of dependent binary variables present in the data set. If you have a dataset and the dependent variable present within them is either 1 or 0 and percentage of 1 is 65% and the percentage of 0 is 35%. Then, we are able to conclude that the probability that any other input that we provide to the algorithm for that variable of being 1 would be 65%.

Marginal likelihood represents the denominator of the Bayes equation and it tends to make sure that the posterior probability is valid by making the area associated with it as 1.

17. How would you differentiate between Lasso and Ridge?

Lasso(L1) and Ridge(L2) are the regularization techniques where we tend to penalize the coefficients in an attempt to find the optimum solution. In ridge, we define the penalty function as the sum of the squares of the coefficients and, in the case of the Lasso, we tend to penalize the sum of the absolute values of the coefficients.

18. How would you differentiate between probability and likelihood?

We can define Probability as the measure of the likelihood that an event is able to occur that is, what is the conviction that a specific event will occur? On the other hand, we can say that a likelihood function is a function of parameters within the parameter space that is able to describe the probability of obtaining the observed data.

So, talking about the fundamental difference, Probability attaches itself to possible results; likelihood attaches itself to hypotheses.

19. What makes the Pruning of your tree obligatory?

Decision Trees are very likely prone to overfitting, pruning the tree enables to reduce the size and thus minimizes the chances of overfitting. Pruning demands the involvement of turning branches of a decision tree into leaf nodes and eliminating the leaf nodes from the original branch. It serves as a tool that enables performing the tradeoff.

20. What are the advantages and limitations of the Temporal Difference Learning Method?

Some of the advantages of the temporal difference learning method are:

  • It is able to learn in every step online or offline.
  • It is able to learn from a sequence that tends to be incomplete as well.
  • It is able to work in continuous environments.
  • It has a lower variance in comparison to the MC method and tends to be more efficient than the MC method.
  • Limitations of the temporal difference method are:
  • It tends to be a biased estimation.
  • It becomes more sensitive to initialization.

21. What factors would help you in handling an imbalanced dataset?

Sampling Techniques enable us to handle an imbalanced dataset. There exist two ways in which we are able to perform sampling, Under Sample or Over Sampling.

In Under Sampling, we tend to significantly reduce the size of the majority class in an attempt to match the minority class thus helping by enhancing performance with respect to storage and run-time execution. However, it potentially discards crucial information.

22. What are some of the widely used EDA Techniques?

Exploratory Data Analysis (EDA) enables analysts for understanding the data better and forms the foundation of better models.

Visualization

  • Univariate visualization
  • Bivariate visualization
  • Multivariate visualization

Missing Value Treatment – Replacing the missing values with either Mean or Median

Outlier Detection – Making Use of Boxplot in an attempt to identify the distribution of Outliers, then Applying IQR to set the boundary for IQR

Transformation – On the basis of the distribution, we tend to apply a transformation to the features

23. Elaborate on the significance of Gamma and Regularization in SVM.

The gamma tends to define influence. Low values indicate far whereas high values indicate close. In case gamma is too large, the radius of the area of influence of the support vectors tends to include only those support vectors themselves and no amount of regularization with C will be able to avoid overfitting. In case gamma is very small, the model becomes too constrained and is unable to capture the complexity of the data.

The regularization parameter (lambda) aids as a degree of importance that it gives to miss-classifications.

24. Differentiate a generative from a discriminative model.

A generative model tends to learn the different categories of data. On the other hand, a discriminative model will only learn the distinctions present between different categories of data. Discriminative models tend to perform much better than generative models when we encounter the problem of classification tasks.

25. Which method is the default method that we use for splitting in decision trees?

The default method that we use for splitting in decision trees is the Gini Index. Gini Index accounts to be the measure of impurity of a particular node. We are able to change it by making changes to classifier parameters.

26. State the significance of the p-value.

Ans. The p-value tends to provide the probability of the null hypothesis to be true. It facilitates us with the statistical significance of our results. To put it in simple words, p-value tends to determine the confidence of a model in a particular output.

27. Are we able to use logistic regression for classes more than 2?

Ans. No, we are not able to use logistic regression for classes more than 2 as it is a binary classifier. For multi-class classification algorithms like Decision Trees, Naïve Bayes’ Classifiers prove to be better-suited options.

28. What do you understand by the hyperparameters of a logistic regression model?

Ans. Classifier penalty, classifier solver, and classifier C account to be the trainable hyperparameters of a Logistic Regression Classifier. We are able to specify these exclusively with values in Grid Search in an attempt to hyper-tune a Logistic Classifier.

29. What do you understand by Heteroscedasticity?

Ans. It accounts to be a situation in which the variance of a variable happens to be unequal across the range of values of the predictor variable.

Ideally, we should avoid it in regression as it tends to introduce unnecessary variance.

30. Do you think the ARIMA model is a good fit for every time series problem?

Ans. No, the ARIMA model does not prove to be a suitable option for every type of time series problem. There exist situations where the ARMA model and others even come in handy.
ARIMA is ideal when different standard temporal structures demand to capture them for time series data.

31. What factors help you in dealing with the class imbalance in a classification problem?

Ans. We are able to deal with the Class imbalance in the following ways:

  • Use class weights
  • Use Sampling
  • Use SMOTE
  • Choose loss functions like Focal Loss

32. What role does cross-validation play?

Ans. Cross-validation is a technique that we are able to use in an attempt to enhance the performance of a machine learning algorithm, where we feed the machine with sampled data out of the same data a few times. We perform the sampling so that we can break the dataset into small parts of an equal number of rows, and then we select a random part as the test set, while all other parts are train sets.

33. What do you understand by a voting model?

Ans. A voting model accounts to be an ensemble model which tends to combine several classifiers, however, in an attempt to produce the final result, in the case of a model which is a classification-based model, takes into account, the task of classification of a certain data point of all the models and selects the most generated option from all the given classes in the target column.

34. How will you deal with the problem of very few data samples? Do you think it is possible to make a model out of it?

Ans. If very few data samples are present, we are able to make use of oversampling in an attempt to generate new data points. By doing so, we are able to have new data points.

35. List out the hyperparameters of an SVM.

Ans. The gamma value, c value, and the type of kernel collectively account to be the hyperparameters of an SVM model.

36. What do you understand by Pandas Profiling?

Ans. Pandas profiling happens to be a step to look out for the effective number of usable data. It facilitates us with the statistics of NULL values and the usable values and thus enables variable selection and data selection for building models in the preprocessing phase to be very effective.

37. Describe the impact of correlation on PCA.

Ans. In case the data happens to be correlated, PCA is not able to work well. Due to the correlation of variables, the effective variance of variables tends to be decreasing. As a consequence of this, when we use correlated data for PCA, it is not able to work well.

38. Differentiate PCA from LDA

Ans. On the core basis, PCA is unsupervised whereas, LDA is unsupervised.
PCA tends to take into consideration the variance, on the other hand, LDA tends to take into account the distribution of classes.

39. List out the various distance metrics that we are able to use in KNN.

Ans. We are able to use the following distance metrics in KNN.

  • Manhattan
  • Minkowski
  • Tanimoto
  • Jaccard
  • Mahalanobis

40. Which metrics are we able to use for measuring the correlation of categorical data?

Ans. We can make use of the Chi-square test for doing this task. It facilitates us with the measure of the correlation between categorical predictors.

41. Which algorithm are we able to use in value imputation in both categorical and continuous categories of data?

Ans. KNN is the only algorithm that we are able to use for the imputation of both categorical and continuous variables.

42. In what situations should we prefer ridge regression over lasso?

Ans. We should prefer ridge regression when we desire to make use of all predictors and are not able to remove any as it tends to reduce the coefficient values but is not able to nullify them.

43. List out the algorithms that we can use for important variable selection.

Ans. Random Forest, Xgboost, and plot variable importance charts are able to serve the purpose of variable selection.

44. What ensemble technique do the Random forests make use of?

Ans. Random Forest makes use of the Bagging technique. Random forests account to be a collection of trees that tend to work on sampled data from the original dataset with the final prediction that happens to be a voted average of all trees.

45. When are we able to treat a categorical value as a continuous variable and what effect does it have when we have done so?

Ans. We can treat a categorical predictor as a continuous one when the nature of the data points it symbolizes is ordinal. In case the predictor variable possesses ordinal data then we are able to treat it as continuous and its inclusion in the model tends to enhance the performance of the model.

46. Describe the role of maximum likelihood in logistic regression.

Ans. Maximum likelihood equation assists us in the estimation of the most probable values of the predictor of the estimator variable coefficients which generates results that are the most likely or most probable and are in close proximity of the truth values.

47. Which type of distance are we supposed to measure in the case of KNN?

Ans. We tend to measure the hamming distance in the case of KNN for the determination of nearest neighbors. K-means even makes use of Euclidean distance.

48. According to you, which sampling technique will be the most suitable when working with time-series data?

Ans. We can make use of a custom iterative sampling such that we unceasingly add samples to the train set. We should always keep in mind that the sample that we use for validation should be added to the next train sets and make use of a new sample for validation.

49. List out the benefits of pruning.

Ans. Pruning assists us in the following:

  • Reducing overfitting
  • Shortening the size of the tree
  • Reducing the complexity of the model
  • Increasing bias

50. What do you understand by the 68 percent rule in normal distribution?

Ans. The normal distribution curve happens to be a bell-shaped curve. Most of the data points surround the median. Consequently, approximately 68 percent of the data surrounds the median. Since there exists no skewness and it’s bell-shaped.

Conclusion

We have thus reached the end of the article where we discussed the important Machine Learning Interview questions. These questions would help you ace your interviews better and even boost your knowledge about the basic concepts.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook


PythonGeeks Team

PythonGeeks Team is dedicated to creating beginner-friendly and advanced tutorials on Python programming, AI, ML, Data Science and more. From web development to machine learning, we help learners build strong foundations and excel in their Python journey.

Leave a Reply

Your email address will not be published. Required fields are marked *