ML Interview Questions with Answers

Get Job-ready with hands-on learning & real-time projects - Enroll Now!

In the last article, we saw some of the interview questions that the recruiters might ask for the Machine Learning Interview. In this article, PythonGeeks brings to you, another set of interview questions where we will explore the difficulty levels of the questions to a greater depth.

Machine Learning Intermediate Interview Questions

1. Differentiate between stochastic gradient descent (SGD) and gradient descent (GD).

We can categorize Gradient Descent and Stochastic Gradient Descent under the algorithms that uncover the set of parameters that tend to minimize a loss function.

The difference in the working of both the algorithms lies in the fact that in Gradient Descent, we evaluate all training samples for each set of parameters. Whereas, in Stochastic Gradient Descent, we tend to evaluate only one training sample for the set of parameters identified.

2. What do you understand about the exploding gradient problem while using backpropagation techniques?

When large error gradients mount up and tend to result in large alterations in the neural network weights during training, we tend to call it the exploding gradient problem. The values of weights can amass and mount up so large as to overflow and may result in NaN values. This renders instability in the model and even halts the learning of the model just like the vanishing gradient problem.

3. What according to you are the differentiating factors between Random Forest and Gradient Boosting machines?

Random forests account for a substantial number of decision trees that the algorithm pools using averages or majority rules at the end.Even Gradient boosting machines tend to combine decision trees; however, they tend to do so at the beginning of the process, unlike Random forests.

Random forest fabricates each tree independent of the others; whereas, gradient boosting produces one tree at a time. Gradient boosting tends to generate better outcomes as compared to random forests if the algorithm meticulously tunes the parameters.

However, it is advisable if the data set comprises a lot of outliers/anomalies/noise since it may result in overfitting of the model. Random forests tend to perform better in the case of multiclass object detection. Gradient Boosting executes well when there is the availability of data that is not balanced such as in the case of real-time risk assessment.

4. What do you understand about the confusion matrix? State its significance

Confusion matrix, also known by the name error matrix, is a table that we make use of recurrently to illustrate the performance of a classification model. This simply implies a classifier on a set of test data for which the true values are eminent or familiar.

It enables us to accurately visualize the performance of an algorithm/model. It even enables us to easily identify the confusion caused amongst different classes. We may even use it as a performance measure of a model/algorithm.

A confusion matrix is renowned as a summary of predictions on a classification model. We tend to summarize the measure of right and wrong predictions with count values and break them down by each class label. It facilitates us with information about the errors that the algorithm makes through the classifier and even the types of errors that the classifiers make.

5. What do you mean by a Fourier transform?

Fourier Transform is a mathematical technique or formulation that tends to transform any function of time to a function of frequency. The Fourier transform is closely interlinked to the Fourier series. We feed it with the input of any time-based pattern and tend to calculate the overall cycle offset, rotation speed, and even strength for all possible cycles.

We can effectively apply Fourier transform to waveforms since it has a functional relationship with time and space. Once we apply a Fourier transform on a waveform, the algorithm decomposes it into a sinusoid.

6. What is Associative Rule Mining (ARM)?

Associative Rule Mining is amongst one of the famous techniques to uncover patterns in data like features (dimensions of the dataset) which tend to occur collectively and features (dimensions) that happen to be correlated.

ARM finds itself extensively useful in Market-based Analysis in order to find how frequently an item set occurs in a transaction. Association rules ought to satisfy minimum support and minimum confidence at every possible frame of time. Association rule generation generally associates themselves with two distinguishing steps:

The algorithm provides a min support threshold to obtain all frequent itemsets in a database.
The algorithm provides a min confidence constraint to these frequent itemsets in order to form the association rules.

7. What do you understand about Marginalization? Explain the process in your own way.

Marginalization is the process of summing the probability of a random variable X, on the condition that the algorithm provides the joint probability distribution of X with other variables. It is an effective application of the law of total probability.

P(X=x) = ∑YP(X=x,Y)

If we have the joint probability P(X=x,Y), we can use marginalization to calculate P(X=x). As a consequence of this, it finds itself useful to look for the distribution of one random variable by exhausting cases on the remaining random variables.

8. How will you adopt the phrase “Curse of Dimensionality”?

The Curse of Dimensionality represents the situation when your data comprises too many features.
We make use of this phrase to express the inconvenience of using brute force or grid search in an attempt to optimize a function with too many inputs.
It can even represent several other issues like:

If the dataset possesses more features than observations, we ought to have a risk of overfitting the model.
When the dataset comprises too many features, observations become trickier to cluster. Too many dimensions may even cause every observation in the dataset to emerge equidistant from all remaining points and hence we cannot form any meaningful clusters.

9. How would you describe the Principal Component Analysis or PCA?

The idea behind deploying this technique is to reduce the dimensionality of the data set by reducing the number of variables present in the dataset that share a relationship with each other. Although the various demands to retain itself to the maximum extent.

The algorithm then transforms the variables into a new set of variables that we collectively refer to as Principal Components. These components are the eigenvectors of a covariance matrix at their core and therefore are orthogonal in nature.

10. State the significance of rotation of components in Principal Component Analysis (PCA).

Rotation in PCA is a very crucial component as it tends to maximize the separation within the variance that the algorithm obtains by all the components because of which interpretation of components would become straightforward. If we do not rotate the components, then we may feel the need for extended components in order to describe the variance of the components.

11. What do you understand about outliers? Mention three methods that you would use to deal with outliers.

A data point that is considerably outlying from the other similar data points as the concerned data point, is what we refer to as an outlier. They may emerge because of experimental errors or variability in measurement. They are quite challenging and ought to mislead a training process, which sooner or later may result in longer training time, inaccurate models, and poor results with low precision.

The three methods we generally prefer to deal with outliers are:

Univariate method – tends to find data points having extreme values on a single variable
Multivariate method – tends to find unusual combinations on all the variables
Minkowski error – significantly reduces the involvement of potential outliers in the training process

12. Differentiate between regularization and normalization.

At its core level, Normalization tends to adjust the data; regularization tends to adjust the prediction function. If our data deals with very different scales (especially low to high), we ought to want to normalize the data. We may tend to alter each column to possess compatible basic statistics. This may prove to be helpful in an attempt to make sure there is no loss of accuracy while doing so.

One of the major goals of model training is to uncover the signal and even ignore the noise. If the given model attempts free rein to minimize error, there may even be a possibility of suffering from overfitting. Regularization tends to impose some control on this by facilitating simpler fitting functions as compared to complex ones.

13. Differentiate between Normalization and Standardization.

Normalization and Standardization are the two eminent methods that find themselves useful in feature scaling. Normalization represents a process for re-scaling the values in order to fit into a range of [0,1]. Standardization represents a process for re-scaling data to associate itself with a mean of 0 and a standard deviation of 1 (Unit variance).

Normalization is beneficial when all parameters ought to have an identical positive scale; however, the algorithm loses the outliers from the data. Hence, we should look out for standardization for most applications.

14. List out the features to check the normality of a data set or a feature

At the basic visual level, we may simply check it using plots. There even exists a list of Normality checks, they are as follows:

Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test

15. State the difference between regression and classification.

We can categorize Regression and Classification under the same umbrella of supervised machine learning. The key difference between these two algorithms is that the output variable in the regression is a numerical or continuous value while that for classification is a categorical or discrete value.

Example: In order to predict the definite Temperature of a geographical region we need to make use of the Regression problem whereas predicting whether the day will be Sunny cloudy or it will be rainy is a case of classification.

16. What are all the assumptions for data to be met before starting with linear regression?

Before beginning with linear regression, the assumptions that we ought to meet are as follow:

Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity

17. In what situations does the linear regression line stop rotating or tends to find an optimal spot where the algorithm fits it on data?

A place where we tend to find the highest RSquared value is the place where the line comes to a halt. RSquared demonstrates the amount of variance the algorithm captures by the virtual linear regression line with respect to the total variance that the algorithm captures by the dataset.

18. Why do you think logistic regression is a type of classification technique and not a regression? Name the function that you can derive from it.

Given that the target column is categorical, it makes use of linear regression to fabricate an odd function that the algorithm wraps with a log function in order to use regression as a classifier. As a result of this, we consider it as a type of classification technique and not a regression. We may even derive it from the cost function.

19. What is your views could be the issue when the beta value for a certain variable varies way too much in each subset when regression is running on different subsets of the given dataset?

Variations in the beta values in every subset suggest that the dataset is heterogeneous in nature. In an attempt to overcome this problem, we ought to use a different model for each of the clustered subsets of the dataset. As an alternative, we may even use a non-parametric model like decision trees.

20. What do you understand about the term Variance Inflation Factor?

Variation Inflation Factor (VIF) demonstrates the ratio of the variance of the concerned model to the variance of the model with a single independent variable. VIF ought to provide the estimation of the volume of multicollinearity present in a set of many regression variables.

In simple words, VIF is the Variance of model/ Variance of the model with a single independent variable

21. Which machine learning algorithm is also known by the name lazy learner and what according to you is the reason behind it?

KNN or K nearest neighbors is the Machine Learning algorithm that we even call a lazy learner. K-NN is a lazy learner since it does not tend to learn any machine-learned values or variables from the input training data but dynamically calculates distance each and every time required to classify, hence it tends to memorize the training dataset instead.

22. According to you, is it possible to make use of KNN for image processing?

Yes, it is achievable to make use of the KNN for image processing. We can achieve so by converting the 3-dimensional image into a single-dimensional vector using the features and even making use of it as input to KNN.

23. What factors help the SVM algorithm to deal with self-learning?

SVM has a learning rate and the expansion rate which takes care of this. The learning rate compensates or penalizes the hyperplanes for making all the wrong moves and the expansion rate deals with finding the maximum separation area between classes.

24. What do you mean by Kernels in SVM? List popular kernels that you use in SVM.

The kernel aims to take data as input and transform it into the desired form. A few widely used Kernels in SVM are RBF, Linear, Sigmoid, Polynomial, Hyperbolic, Laplace, and many others.

25. How would you define Kernel Trick in an SVM Algorithm?

Kernel Trick represents a mathematical function which when we apply data points, is able to look out for the region of classification between two distinguishing classes. On the basis of the choice of function, be it linear in nature or radial nature, which extensively depends upon the distribution of data, one is able to build a classifier.

26. What do you understand about ensemble models? Describe how ensemble techniques may yield better learning as compared to traditional classification ML algorithms.

Ensemble refers to a group of models that we can use collectively for prediction in both classification and regression classes. Ensemble learning assists us in enhancing ML results since it tends to combine several models. By doing so, it enables a better predictive performance as compared to single model output.

They tend to be superior to individual models as they are capable of reducing variance, averaging out biases, and having lesser chances of overfitting.

27. What do you understand by overfitting and underfitting? What according to you are the reasons that the decision tree algorithm suffers often overfitting problems?

Overfitting attempts to demonstrate a statistical model or machine learning algorithm which in turn tries to capture the noise of the data. Underfitting represents a model or machine learning algorithm that is not able to fit the data well enough and emerges if the model or algorithm tends to show low variance but high bias.

In decision trees, overfitting emerges when we design the tree to perfectly fit all samples in the training data set. This may result in branching with strict rules that data has to adhere to or sparse data and it even affects the accuracy when predicting samples that do not constitute a part of the training set.

28. What do you understand about the OOB error and how does it emerge?

For each bootstrap sample, there exists one-third of the data that we do not use in the creation of the tree, implying that it was out of the sample. We can refer to this as out-of-bag data or OBB. In order to achieve an unbiased measure of the precision of the model over test data, we may make use of the out-of-bag error. We tend to pass them out of bag data for each tree which in turn passes through that tree and then aggregates the outputs to give out of bag error.

29. Why do you think boosting is a more stable algorithm as compared to other ensemble algorithms?

Boosting aims to focus on errors that it finds in the previous iterations until they become obsolete with respect to the size of the input data. Whereas in bagging there does not exist any corrective loop. This is the major reason why boosting exhibits more stability as compared to other ensemble algorithms.

30. What steps would you perform to handle outliers in the data?

We may tend to discover outliers using tools and functions such as box plot, scatter plot, Z-Score, IQR score, and many others, and then we are able to handle them based on the visualization we receive. In an attempt to handle outliers, we can even cap at some threshold values, make use of transformations in an attempt to reduce the skewness of the data, and even remove outliers if they exhibit anomalies or errors.

31. List out the widely used cross-validation techniques.

There are majorly six types of cross-validation techniques that we use often:

K fold
Stratified k fold
Leave one out
Bootstrapping
Random search cv
Grid search cv

32. In your view, is it plausible to test for the probability of enhancing model accuracy without cross-validation techniques? If yes, please explain your choice.

Yes, it is feasible to test for the probability of enhancing model accuracy without cross-validation techniques. We can achieve so by running the ML model for any arbitrary n number of iterations, then recording the accuracy. As a next step, we need to plot all the accuracies and eliminate the 5% of low probability values.

Next, we need to measure the left [low] cut-off and right [high] cut-off values from the dataset. With the help of the remaining 95% confidence, we can infer that the model is able to go as low or as high.

33. You are required to train a 12GB dataset with the help of a neural network with a machine that has only 3GB RAM. How would you proceed?

We effectively make use of NumPy arrays to solve this issue. First, we need to load all the input data into an array. In NumPy, arrays ought to have a property to map the complete dataset without loading it entirely in memory. We can even pass the index of the array, dividing data into significant batches, in order to get the data that we require and then feed this data to the neural networks that we have built. However, we need to be careful about keeping the batch size normal.

34. State the advantages and disadvantages of using Neural Networks for solving our problems.

Advantages:

We are able to store information on the entire network instead of storing it in a limited database. It even possesses the ability to work and provide a better precision result even with inadequate information. A neural network possesses parallel processing ability and distributed memory which works to its advantage.

Disadvantages:

Neural Networks demand processors which are efficient in parallel processing. The still-unexplained functioning of the network is also quite troublesome as it tends to reduce the reliance on the network in some situations like when we ought to show the problem we have noticed to the network. The duration of the network is majorly unrecognizable.

35. Define a hash table according to your understanding of it.

Hashing is a technique that serves the purpose of identifying unique objects from a group of similar objects. Hash functions represent large keys that the algorithm tends to convert into small keys in hashing techniques. The algorithm tends to store the values of hash functions in data structures which are known as hash tables.

36. What do you understand about the meshgrid () method and the contourf () method?

The meshgrid( ) function in NumPy ought to receive two arguments as input, namely, the range of x-values present in the grid, the range of y-values present in the grid. Whereas meshgrid () demands to build itself prior to the contourf( ) function in matplotlib and makes use of which receives many inputs as x-values, y-values, fitting curve, and many more. We make use of the contourf () method to draw filled contours by providing the given x-axis inputs, y-axis inputs, contour lines, colors, and many such factors.

37. What do you understand by the terms hyperparameters? Differentiate them from parameters.

A parameter represents a variable that is internal to the model and we tend to estimate their value from the training data.

A hyperparameter represents a variable that is external to the model and we are not able to estimate their value from the data. We often use them to estimate model parameters.

38. Differentiate between a generative and discriminative model.

A generative model tends to learn about the different categories of data. On the other side, a discriminative model will only try to learn about the distinctions between different categories of data present in the given input. Discriminative models ought to perform much better as compared to generative models when it comes to classification tasks.

39. What is ROC curve work?

The graphical visualization of the contrast between true positive rates and the false positive rate at various thresholds in the given dataset is known as the ROC curve. We make use of it as a proxy for the trade-off between true positives and false positives.

40. What according to you are the differentiating factors between Statistical Modeling and Machine Learning?

Machine learning models deal with making accurate predictions about the situations such as FootFall in restaurants, Stock-Price, and many such examples. Whereas, we design Statistical models for extrapolation about the relationships between variables.

41. Develop a simple code to binarize data.

Conversion of data into binary values based on a certain threshold is known as binarizing of data.

Code:

from sklearn.preprocessing import Binarizer
import pandas
import numpy
names_list = ['Alaska', 'Pratyush', 'Pierce', 'Sandra', 'Soundarya', 'Meredith', 'Richard', 'Jackson', 'Tom',’Joe’]
data_frame = pandas.read_csv(url, names=names_list)
array = dataframe.values
# Splitting the array into input and output 
A = array [: 0:7]
B = array [:7]
binarizer = Binarizer(threshold=0.0). fit(X)
binaryA = binarizer.transform(A)
numpy.set_printoptions(precision=5)
print (binaryA [0:7:])

42. How will you define functions in Python?

Functions in Python demonstrate blocks that have organized, and reusable codes that attempt to perform single, and related events. Functions are crucial for the creation of better modularity for applications that reuse a high degree of coding.

43. How will you define data frames?

A pandas data frame represents a data structure that is mutable in nature. Pandas even facilitate support for heterogeneous data which tends to arrange across two axes.

44. State the advantages of using an Array.

It enables Random access
Saves memory
Cache friendly
Predictable compile timing
Assists in re-usability of code

45. Describe Eigenvectors and Eigenvalues.

Linear transformations are beneficial for understanding eigenvectors. They extensively find their prime usage in the fabrication of covariance and correlation matrices in the arena of data science.
In plain words, eigenvectors demonstrate directional entities along which we are able to apply linear transformation features like compression, flip, and many others.

46. What are the performance metrics that you can make use of in an attempt to estimate the efficiency of a linear regression model?

The performance metrics that we ought to use in this case are:

Mean Squared Error
R2 score
Adjusted R2 score
Mean Absolute score

47. What according to you is the default method of splitting in decision trees?

The default method that we make use of for splitting in decision trees is the Gini Index. Gini Index demonstrates the measure of impurity of a particular node in the structure.

48. In your view, how will you justify the use of p-value?

The p-value tends to provide the probability of the null hypothesis that is true. It even facilitates us with the statistical significance of our results of the model. In simple words, p-value tends to determine the confidence of a model in a particular output of the concerned model.

49. List out a few hyper-parameters of decision trees

The major features which one can tune in decision trees are:

Splitting criteria
Min_leaves
Min_samples
Max_depth

50. What are the ways in which we can deal with multicollinearity?

We are able to deal with Multicollinearity with the help of the following steps:
Removing highly correlated predictors present in the model.
Through deployment of Partial Least Squares Regression (PLS) or Principal Components Analysis

Conclusion

We have thus reached the end of the article where we discussed the important Machine Learning Interview questions. These questions would help you ace your interviews better and even boost your knowledge about the basic concepts.