Upload 59 files

0313279 verified about 1 month ago

37.5 kB

	{
	"metadata": {
	"model_name": "llama-3.3-70b-instruct",
	"ntrain": 0,
	"ntest": -1,
	"timestamp": "20241212-084645"
	},
	"subject": "machine_learning",
	"category": "STEM",
	"correct_rate": 56.25,
	"evaluation_results": [
	{
	"question": "Statement 1\| Linear regression estimator has the smallest variance among all unbiased estimators. Statement 2\| The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative.",
	"model_answer": "C",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 32.805498123168945
	},
	{
	"question": "Statement 1\| RoBERTa pretrains on a corpus that is approximate 10x larger than the corpus BERT pretrained on. Statement 2\| ResNeXts in 2018 usually used tanh activation functions.",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 32.805607318878174
	},
	{
	"question": "Statement 1\| Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example. Statement 2\| We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels.",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 32.805702209472656
	},
	{
	"question": "A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2 possible values each. The class has 3 possible values. How many maximum possible different examples are there?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 32.805793046951294
	},
	{
	"question": "As of 2020, which architecture is best for classifying high-resolution images?",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 32.805885314941406
	},
	{
	"question": "Statement 1\| The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. Statement 2\| One disadvantage of Q-learning is that it can only be used when the learner has prior knowledge of how its actions affect its environment.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 32.805976152420044
	},
	{
	"question": "Let us say that we have computed the gradient of our cost function and stored it in a vector g. What is the cost of one gradient descent update given the gradient?",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 32.806068420410156
	},
	{
	"question": "Statement 1\| For a continuous random variable x and its probability distribution function p(x), it holds that 0 ≤ p(x) ≤ 1 for all x. Statement 2\| Decision tree is learned by minimizing information gain.",
	"model_answer": "C",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 32.806156158447266
	},
	{
	"question": "Consider the Bayesian network given below. How many independent parameters are needed for this Bayesian Network H -> U <- P <- W?",
	"model_answer": "B",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 32.80624222755432
	},
	{
	"question": "As the number of training examples goes to infinity, your model trained on that data will have:",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 32.80632710456848
	},
	{
	"question": "Statement 1\| The set of all rectangles in the 2D plane (which includes non axisaligned rectangles) can shatter a set of 5 points. Statement 2\| The VC-dimension of k-Nearest Neighbour classifier when k = 1 is infinite.",
	"model_answer": "D",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.305633783340454
	},
	{
	"question": "_ refers to a model that can neither model the training data nor generalize to new data.",
	"model_answer": "C.",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 31.305754899978638
	},
	{
	"question": "Statement 1\| The F1 score can be especially useful for datasets with class high imbalance. Statement 2\| The area under the ROC curve is one of the main metrics used to assess anomaly detectors.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.305847883224487
	},
	{
	"question": "Statement 1\| The back-propagation algorithm learns a globally optimal neural network with hidden layers. Statement 2\| The VC dimension of a line should be at most 2, since I can find at least one case of 3 points that cannot be shattered by any line.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 31.305938959121704
	},
	{
	"question": "High entropy means that the partitions in classification are",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 31.306030750274658
	},
	{
	"question": "Statement 1\| Layer Normalization is used in the original ResNet paper, not Batch Normalization. Statement 2\| DCGANs use self-attention to stabilize training.",
	"model_answer": "C",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 31.30612301826477
	},
	{
	"question": "In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that",
	"model_answer": "A",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 31.306216955184937
	},
	{
	"question": "For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.306306838989258
	},
	{
	"question": "For polynomial regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.306394815444946
	},
	{
	"question": "Statement 1\| As of 2020, some models attain greater than 98% accuracy on CIFAR-10. Statement 2\| The original ResNets were not optimized with the Adam optimizer.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.30648183822632
	},
	{
	"question": "The K-means algorithm:",
	"model_answer": "D",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.132917881011963
	},
	{
	"question": "Statement 1\| VGGNets have convolutional kernels of smaller width and height than AlexNet's first-layer kernels. Statement 2\| Data-dependent weight initialization procedures were introduced before Batch Normalization.",
	"model_answer": "D",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 30.133347988128662
	},
	{
	"question": "What is the rank of the following matrix? A = [[1, 1, 1], [1, 1, 1], [1, 1, 1]]",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 30.13394594192505
	},
	{
	"question": "Statement 1\| Density estimation (using say, the kernel density estimator) can be used to perform classification. Statement 2\| The correspondence between logistic regression and Gaussian Naive Bayes (with identity class covariances) means that there is a one-to-one correspondence between the parameters of the two classifiers.",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 30.13504195213318
	},
	{
	"question": "Suppose we would like to perform clustering on spatial data such as the geometrical locations of houses. We wish to produce clusters of many different sizes and shapes. Which of the following methods is the most appropriate?",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 30.135695934295654
	},
	{
	"question": "Statement 1\| In AdaBoost weights of the misclassified examples go up by the same multiplicative factor. Statement 2\| In AdaBoost, weighted training error e_t of the tth weak classifier on training data with weights D_t tends to increase as a function of t.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 30.13606095314026
	},
	{
	"question": "MLE estimates are often undesirable because",
	"model_answer": "A.",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 30.136688947677612
	},
	{
	"question": "Computational complexity of Gradient descent is,",
	"model_answer": "B.",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.137227058410645
	},
	{
	"question": "Averaging the output of multiple decision trees helps _.",
	"model_answer": "B",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 30.138023853302002
	},
	{
	"question": "The model obtained by applying linear regression on the identified subset of features may differ from the model obtained at the end of the process of identifying the subset during",
	"model_answer": "D",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.13900399208069
	},
	{
	"question": "Neural networks:",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 30.223735094070435
	},
	{
	"question": "Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for \"tests positive.\" Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(TP), the prior probability of testing positive.",
	"model_answer": "A",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.223953008651733
	},
	{
	"question": "Statement 1\| After mapped into feature space Q through a radial basis kernel function, 1-NN using unweighted Euclidean distance may be able to achieve better classification performance than in original space (though we can’t guarantee this). Statement 2\| The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM.",
	"model_answer": "C",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 30.224052906036377
	},
	{
	"question": "The disadvantage of Grid search is",
	"model_answer": "C",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 30.22414207458496
	},
	{
	"question": "Predicting the amount of rainfall in a region based on various cues is a ______ problem.",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.22423219680786
	},
	{
	"question": "Which of the following sentence is FALSE regarding regression?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 30.2243230342865
	},
	{
	"question": "Which one of the following is the main reason for pruning a Decision Tree?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 30.224412202835083
	},
	{
	"question": "Statement 1\| The kernel density estimator is equivalent to performing kernel regression with the value Yi = 1/n at each point Xi in the original data set. Statement 2\| The depth of a learned decision tree can be larger than the number of training examples used to create the tree.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 30.224501848220825
	},
	{
	"question": "Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 30.22459101676941
	},
	{
	"question": "Statement 1\| The softmax function is commonly used in mutliclass logistic regression. Statement 2\| The temperature of a nonuniform softmax distribution affects its entropy.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.22468090057373
	},
	{
	"question": "Which of the following is/are true regarding an SVM?",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 34.56947612762451
	},
	{
	"question": "Which of the following is the joint probability of H, U, P, and W described by the given Bayesian Network H -> U <- P <- W? [note: as the product of the conditional probabilities]",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 34.56960988044739
	},
	{
	"question": "Statement 1\| Since the VC dimension for an SVM with a Radial Base Kernel is infinite, such an SVM must be worse than an SVM with polynomial kernel which has a finite VC dimension. Statement 2\| A two layer neural network with linear activation functions is essentially a weighted combination of linear separators, trained on a given dataset; the boosting algorithm built on linear separators also finds a combination of linear separators, therefore these two algorithms will give the same result.",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 34.569709062576294
	},
	{
	"question": "Statement 1\| The ID3 algorithm is guaranteed to find the optimal decision tree. Statement 2\| Consider a continuous probability distribution with density f() that is nonzero everywhere. The probability of a value x is equal to f(x).",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 34.56980299949646
	},
	{
	"question": "Given a Neural Net with N input nodes, no hidden layers, one output node, with Entropy Loss and Sigmoid Activation Functions, which of the following algorithms (with the proper hyper-parameters and initialization) can be used to find the global optimum?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 34.56989574432373
	},
	{
	"question": "Adding more basis functions in a linear model, pick the most probably option:",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 34.57005596160889
	},
	{
	"question": "Consider the Bayesian network given below. How many independent parameters would we need if we made no assumptions about independence or conditional independence H -> U <- P <- W?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 34.57017397880554
	},
	{
	"question": "Another term for out-of-distribution detection is?",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 34.57037591934204
	},
	{
	"question": "Statement 1\| We learn a classifier f by boosting weak learners h. The functional form of f’s decision boundary is the same as h’s, but with different parameters. (e.g., if h was a linear classifier, then f is also a linear classifier). Statement 2\| Cross validation can be used to select the number of iterations in boosting; this procedure may help reduce overfitting.",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 34.57126593589783
	},
	{
	"question": "Statement 1\| Highway networks were introduced after ResNets and eschew max pooling in favor of convolutions. Statement 2\| DenseNets usually cost more memory than ResNets.",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 34.571375131607056
	},
	{
	"question": "If N is the number of instances in the training dataset, nearest neighbors has a classification run time of",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 30.35732102394104
	},
	{
	"question": "Statement 1\| The original ResNets and Transformers are feedforward neural networks. Statement 2\| The original Transformers use self-attention, but the original ResNet does not.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 30.357688903808594
	},
	{
	"question": "Statement 1\| RELUs are not monotonic, but sigmoids are monotonic. Statement 2\| Neural networks trained with gradient descent with high probability converge to the global optimum.",
	"model_answer": "C",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 30.358047008514404
	},
	{
	"question": "The numerical output of a sigmoid node in a neural network:",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 30.35838794708252
	},
	{
	"question": "Which of the following can only be used when training data are linearly separable?",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.358869791030884
	},
	{
	"question": "Which of the following are the spatial clustering algorithms?",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 30.359225749969482
	},
	{
	"question": "Statement 1\| The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classifiers. Statement 2\| Any decision boundary that we get from a generative model with classconditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel of degree less than or equal to three.",
	"model_answer": "B",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 30.359570026397705
	},
	{
	"question": "Statement 1\| L2 regularization of linear models tends to make models more sparse than L1 regularization. Statement 2\| Residual connections can be found in ResNets and Transformers.",
	"model_answer": "D",
	"correct_answer": 3,
	"is_correct": true,
	"response_time": 30.359920978546143
	},
	{
	"question": "Suppose we like to calculate P(H\|E, F) and we have no conditional independence information. Which of the following sets of numbers are sufficient for the calculation?",
	"model_answer": "C",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 30.360265970230103
	},
	{
	"question": "Which among the following prevents overfitting when we perform bagging?",
	"model_answer": "A",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 30.360610723495483
	},
	{
	"question": "Statement 1\| PCA and Spectral Clustering (such as Andrew Ng’s) perform eigendecomposition on two different matrices. However, the size of these two matrices are the same. Statement 2\| Since classification is a special case of regression, logistic regression is a special case of linear regression.",
	"model_answer": "C",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 31.28680396080017
	},
	{
	"question": "Statement 1\| The Stanford Sentiment Treebank contained movie reviews, not book reviews. Statement 2\| The Penn Treebank has been used for language modeling.",
	"model_answer": "D",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.288477182388306
	},
	{
	"question": "What is the dimensionality of the null space of the following matrix? A = [[3, 2, −9], [−6, −4, 18], [12, 8, −36]]",
	"model_answer": "B",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 31.29107403755188
	},
	{
	"question": "What are support vectors?",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 31.29118514060974
	},
	{
	"question": "Statement 1\| Word2Vec parameters were not initialized using a Restricted Boltzman Machine. Statement 2\| The tanh function is a nonlinear activation function.",
	"model_answer": "D",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.291295051574707
	},
	{
	"question": "If your training loss increases with number of epochs, which of the following could be a possible issue with the learning process?",
	"model_answer": "C.",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 31.291388273239136
	},
	{
	"question": "Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for \"tests positive.\" Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(D \| TP), the posterior probability that you have disease D when the test is positive?",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 31.291479110717773
	},
	{
	"question": "Statement 1\| Traditional machine learning results assume that the train and test sets are independent and identically distributed. Statement 2\| In 2017, COCO models were usually pretrained on ImageNet.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.29157304763794
	},
	{
	"question": "Statement 1\| The values of the margins obtained by two different kernels K1(x, x0) and K2(x, x0) on the same training set do not tell us which classifier will perform better on the test set. Statement 2\| The activation function of BERT is the GELU.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.291664123535156
	},
	{
	"question": "Which of the following is a clustering algorithm in machine learning?",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.291754007339478
	},
	{
	"question": "You've just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?",
	"model_answer": "C.",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.903929948806763
	},
	{
	"question": "K-fold cross-validation is",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 31.904037952423096
	},
	{
	"question": "Statement 1\| Industrial-scale neural networks are normally trained on CPUs, not GPUs. Statement 2\| The ResNet-50 model has over 1 billion parameters.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 31.904130935668945
	},
	{
	"question": "Given two Boolean random variables, A and B, where P(A) = 1/2, P(B) = 1/3, and P(A \| ¬B) = 1/4, what is P(A \| B)?",
	"model_answer": "B",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 31.904222011566162
	},
	{
	"question": "Existential risks posed by AI are most commonly associated with which of the following professors?",
	"model_answer": "C.",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 31.90431809425354
	},
	{
	"question": "Statement 1\| Maximizing the likelihood of logistic regression model yields multiple local optimums. Statement 2\| No classifier can do better than a naive Bayes classifier if the distribution of the data is known.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 31.90440797805786
	},
	{
	"question": "For Kernel Regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 31.90450096130371
	},
	{
	"question": "Statement 1\| The SVM learning algorithm is guaranteed to find the globally optimal hypothesis with respect to its object function. Statement 2\| After being mapped into feature space Q through a radial basis kernel function, a Perceptron may be able to achieve better classification performance than in its original space (though we can’t guarantee this).",
	"model_answer": "D",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 31.90459704399109
	},
	{
	"question": "For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 31.9046950340271
	},
	{
	"question": "Statement 1\| Overfitting is more likely when the set of training data is small. Statement 2\| Overfitting is more likely when the hypothesis space is small.",
	"model_answer": "C",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 31.90478491783142
	},
	{
	"question": "Statement 1\| Besides EM, gradient descent can be used to perform inference or learning on Gaussian mixture model. Statement 2 \| Assuming a fixed number of attributes, a Gaussian-based Bayes optimal classifier can be learned in time linear in the number of records in the dataset.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.592442750930786
	},
	{
	"question": "Statement 1\| In a Bayesian network, the inference results of the junction tree algorithm are the same as the inference results of variable elimination. Statement 2\| If two random variable X and Y are conditionally independent given another random variable Z, then in the corresponding Bayesian network, the nodes for X and Y are d-separated given Z.",
	"model_answer": "A",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.592967748641968
	},
	{
	"question": "Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. What kind of learning problem is this?",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 30.593397855758667
	},
	{
	"question": "What would you do in PCA to get the same projection as SVD?",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.593926906585693
	},
	{
	"question": "Statement 1\| The training error of 1-nearest neighbor classifier is 0. Statement 2\| As the number of data points grows to infinity, the MAP estimate approaches the MLE estimate for all possible priors. In other words, given enough data, the choice of prior is irrelevant.",
	"model_answer": "D",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 30.595048904418945
	},
	{
	"question": "When doing least-squares regression with regularisation (assuming that the optimisation can be done exactly), increasing the value of the regularisation parameter λ the testing error.",
	"model_answer": "B.",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 30.595489978790283
	},
	{
	"question": "Which of the following best describes what discriminative approaches try to model? (w are the parameters in the model)",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 30.596614837646484
	},
	{
	"question": "Statement 1\| CIFAR-10 classification performance for convolution neural networks can exceed 95%. Statement 2\| Ensembles of neural networks do not improve classification accuracy since the representations they learn are highly correlated.",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 30.599475860595703
	},
	{
	"question": "Which of the following points would Bayesians and frequentists disagree on?",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 30.600045680999756
	},
	{
	"question": "Statement 1\| The BLEU metric uses precision, while the ROGUE metric uses recall. Statement 2\| Hidden markov models were frequently used to model English sentences.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 30.600421905517578
	},
	{
	"question": "Statement 1\| ImageNet has images of various resolutions. Statement 2\| Caltech-101 has more images than ImageNet.",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 29.973453044891357
	},
	{
	"question": "Which of the following is more appropriate to do feature selection?",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 29.973824977874756
	},
	{
	"question": "Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify?",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 29.974172830581665
	},
	{
	"question": "For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 29.975064039230347
	},
	{
	"question": "Statement 1\| For any two variables x and y having joint distribution p(x, y), we always have H[x, y] ≥ H[x] + H[y] where H is entropy function. Statement 2\| For some directed graphs, moralization decreases the number of edges present in the graph.",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 29.97546887397766
	},
	{
	"question": "Which of the following is NOT supervised learning?",
	"model_answer": "A.",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 29.975868940353394
	},
	{
	"question": "Statement 1\| A neural network's convergence depends on the learning rate. Statement 2\| Dropout multiplies randomly chosen activation values by zero.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 29.979493856430054
	},
	{
	"question": "Which one of the following is equal to P(A, B, C) given Boolean random variables A, B and C, and no independence or conditional independence assumptions between any of them?",
	"model_answer": "D",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 29.99234390258789
	},
	{
	"question": "Which of the following tasks can be best solved using Clustering.",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 29.992503881454468
	},
	{
	"question": "After applying a regularization penalty in linear regression, you find that some of the coefficients of w are zeroed out. Which of the following penalties might have been used?",
	"model_answer": "B",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 29.992599964141846
	},
	{
	"question": "A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?",
	"model_answer": "D",
	"correct_answer": 1,
	"is_correct": false,
	"response_time": 50.51357913017273
	},
	{
	"question": "Statement 1\| When learning an HMM for a fixed set of observations, assume we do not know the true number of hidden states (which is often the case), we can always increase the training data likelihood by permitting more hidden states. Statement 2\| Collaborative filtering is often a useful model for modeling users' movie preference.",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 50.51396989822388
	},
	{
	"question": "You are training a linear regression model for a simple estimation task, and notice that the model is overfitting to the data. You decide to add in $\\ell_2$ regularization to penalize the weights. As you increase the $\\ell_2$ regularization coefficient, what will happen to the bias and variance of the model?",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 50.514410972595215
	},
	{
	"question": "Which PyTorch 1.8 command(s) produce $10\\times 5$ Gaussian matrix with each entry i.i.d. sampled from $\\mathcal{N}(\\mu=5,\\sigma^2=16)$ and a $10\\times 10$ uniform matrix with each entry i.i.d. sampled from $U[-1,1)$?",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 50.51485204696655
	},
	{
	"question": "Statement 1\| The ReLU's gradient is zero for $x<0$, and the sigmoid gradient $\\sigma(x)(1-\\sigma(x))\\le \\frac{1}{4}$ for all $x$. Statement 2\| The sigmoid has a continuous gradient and the ReLU has a discontinuous gradient.",
	"model_answer": "C",
	"correct_answer": 0,
	"is_correct": false,
	"response_time": 50.51497292518616
	},
	{
	"question": "Which is true about Batch Normalization?",
	"model_answer": "B.",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 50.51506996154785
	},
	{
	"question": "Suppose we have the following objective function: $\\argmin_{w} \\frac{1}{2} \\norm{Xw-y}^2_2 + \\frac{1}{2}\\gamma \\norm{w}^2_2$ What is the gradient of $\\frac{1}{2} \\norm{Xw-y}^2_2 + \\frac{1}{2}\\lambda \\norm{w}^2_2$ with respect to $w$?",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 50.51516795158386
	},
	{
	"question": "Which of the following is true of a convolution kernel?",
	"model_answer": "B",
	"correct_answer": 1,
	"is_correct": true,
	"response_time": 50.515406131744385
	},
	{
	"question": "Which of the following is false?",
	"model_answer": "A",
	"correct_answer": 3,
	"is_correct": false,
	"response_time": 50.51551795005798
	},
	{
	"question": "Which of the following is false?",
	"model_answer": "B",
	"correct_answer": 2,
	"is_correct": false,
	"response_time": 50.5156090259552
	},
	{
	"question": "We are training fully connected network with two hidden layers to predict housing prices. Inputs are $100$-dimensional, and have several features such as the number of square feet, the median family income, etc. The first hidden layer has $1000$ activations. The second hidden layer has $10$ activations. The output is a scalar representing the house price. Assuming a vanilla network with affine transformations and with no batch normalization and no learnable parameters in the activation function, how many parameters does this network have?",
	"model_answer": "A",
	"correct_answer": 0,
	"is_correct": true,
	"response_time": 10.288448095321655
	},
	{
	"question": "Statement 1\| The derivative of the sigmoid $\\sigma(x)=(1+e^{-x})^{-1}$ with respect to $x$ is equal to $\\text{Var}(B)$ where $B\\sim \\text{Bern}(\\sigma(x))$ is a Bernoulli random variable. Statement 2\| Setting the bias parameters in each layer of neural network to 0 changes the bias-variance trade-off such that the model's variance increases and the model's bias decreases",
	"model_answer": "C",
	"correct_answer": 2,
	"is_correct": true,
	"response_time": 10.288880109786987
	}
	]
	}