# Adv Data Science using R Quiz and Assignment 3

These are Advance Data Science Using R Assignment 3 Solution

## Quiz

Q1. Which of the following language is used in Data science?
R
C
C++
Ruby

Q2. What is the primary file type of R?
Vector
Text file
RScripts
Statistical file

Q3. Which one of the following R packages is used for data products?
haven
igraph
slidify
forecast

These are Advance Data Science Using R Assignment 3 Solution

Q4. Which of the following is valid for checking categorical variable?
Level
Table
Unique
All of the above

Q5. Suppose ABC is the matrix of 3 rows and 4 columns. Choose correct option(s) to rename columns:
row_names(ABC)= c(“row1”,”row2”,”row3”)
rownames(ABC)=c(“row1”,”row2”)
row(ABC)=c(“row1”,”row2”)
rownames(ABC)=c(“row”,”row2”,”row3”)

Q6. Arrange in proper order of data type:
Logical, integer, numeric, character
Integer, numeric, character, logical
Character, logical, integer, numeric
Numeric, integer, character, logical

These are Advance Data Science Using R Assignment 3 Solution

Q7. What is the output of below code:
A=10
B=20
print(A,B)

10 20
Error
(10, 20)
None of the above

Q8. Return statement is compulsory while writing function in R
True
False

Q9. Last variable in function is by default return variable in R
True
False

These are Advance Data Science Using R Assignment 3 Solution

Q10. What package is need to be install for reading?

Q11. what is the output of below mentioned code?
logic1=c(T,F,F,T,F,T)
print(which (logic))

1 4 6
2 3 6
6 4 1
1 2 3

Q12. If A = c (1, 13, 42, 13, 4) then what is A = A [ -4 ]?
1, 13, 42, 4
1, 13, 42, 13
13
1, 42, 13, 4

These are Advance Data Science Using R Assignment 3 Solution

Q13. what function can be used to split the string?
Output will be : “Navin” “Mr. Naresh J”

strsplit(name,”[.]”)
charsplit(name,”[,]”)
stringsplit(name)
strsplit(name,”[,]”)

Q14. i=100 , how to find out data type of i
Option 1
type(i)
class(i)
none of the above

Q15. Dt = “01-12-2020” is in the form of character. What is the option to convert date into “MM-DD-YYYY”
To_date (dt, ”MM – DD – YYYY”)
date( x = dt, format = “%m / %d / %Y”)
Date ( x = dt, format = “%m / %d / %Y”)
none of the above

Answer: Date ( x = dt, format = “%m / %d / %Y”)

These are Advance Data Science Using R Assignment 3 Solution

## Assignment 3

### Q1] What Is KNN Algorithm? Features Of KNN Algorithm. How Does KNN Algorithm Work? Write KNN algorithm pseudocode and Practical Implementation Of KNN Algorithm In R.

KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a new data point into the target class, depending on the features of its neighboring data points.

The KNN algorithm has the following features:

• KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points.
• It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
• It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the data point into the class it is most similar to.
• Unlike most algorithms, KNN is a non-parametric model which means that it does not make any assumptions about the data set. This makes the algorithm more effective since it can handle realistic data.
• KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative function from the training data.
• KNN can be used for solving both classification and regression problems.

## KNN Algorithm Pseudocode

Consider the set, (Xi, Ci),

• Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
• Ci denotes the output class for Xi for each i

The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the total number of classes is denoted by ‘c’.

Now let’s pretend that there’s a data point ‘x’ whose output class needs to be predicted. This can be done by using the K-Nearest Neighbour (KNN) Algorithm.

Calculate D(x, xi), where ‘i’ =1, 2, ….., n and ‘D’ is the Euclidean measure between the data points.

The calculated Euclidean distances must be arranged in ascending order.

Initialize k and take the first k distances from the sorted list.

Figure out the k points for the respective k distances.

Calculate ki, which indicates the number of data points belonging to the ith class among k points i.e. k ≥ 0

If ki >kj ∀ i ≠ j; put x in class i.

## Practical Implementation Of KNN Algorithm In R

Problem Statement: To study a bank credit dataset and build a Machine Learning model that predicts whether an applicant’s loan can be approved or not based on his socio-economic profile.

Dataset Description: The bank credit dataset contains information about 1000s of applicants. This includes their account balance, credit amount, age, occupation, loan records, etc. By using this data, we can predict whether or not to approve the loan of an applicant.

Logic: This problem statement can be solved using the KNN algorithm that will classify the applicant’s loan request into two classes:

• Approved
• Disapproved

These are Advance Data Science Using R Assignment 3 Solution

## Problem Statement:

To Study a heart disease data set and to model a classifier for predicting whether a patient is suffering from any heart disease or not.

We’ll be using the Caret package. The caret package is also known as the Classification And REgression Training, has tons of functions that helps to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning, unsupervised learning algorithms, etc.

So, to use it, we first need to install it using this command:

install.packages(“caret”)

The caret package is very helpful because it provides us direct access to various functions for training our model with various machine learning algorithms like KNN, SVM, decision tree, linear regression, etc.

After installing it, we just need to load the package into our console, to do that we have this code:

library(‘caret’)

Our next step is to load the data set.

For this demo, we’ll be using a Heart Disease data set which consists of various attributes like the person’s age, sex, cholesterol level and etc. In the same data set, we’ll have a target variable, which is used to predict whether a patient is suffering from any heart disease or not

In short, we’ll be using SVM to classify whether a person is going to be prone to heart disease or not.

The data set looks like this:

This data set has around 14 attributes and the last attribute is the target variable which we’ll be predicting using our SVM model.

Now it’s time to load the data set:

In the above line of code, we’re reading the dataset which is stored in a CSV format and that’s why we’ve used the read.csv function to read it from the specified path.

The ‘sep’ attribute indicates that the data is stored in a CSV or Comma Separated Version.

Now that we’ve imported our dataset, let’s check the structure of our dataset:

For checking the structure of data frame we can call the function str():

The output shows us that our dataset consists of 300 observations each with 14 attributes.

If you want to display the top 5-6 rows of the data set, use the head() function:

Our next step is to split the data into training set and testing set, this is also called data splicing.

We’ll be using the training set specifically for our model building and the testing set for evaluating the model:

The caret package provides a method createDataPartition() which is basically for partitioning our data into train and test set.

We’ve passed 3 parameters to this createdatapartition() function:

The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V14, so we are passing heart\$V14

The “p” parameter holds a decimal value in the range of 0-1. It’s to show the percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. So, 70% of the data is used for training and the remaining 30% is for testing the model.

The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list

Now this createDataPartition() method is returning a matrix “intrain”. This intrain matrix has our training data set and we’re storing this in the ‘training’ variable and the rest of the data, i.e. the remaining 30% of the data is stored in the testing variable.

Next, for checking the dimensions of our training data frame and testing data frame, we can use these:

Our next step is to clean the data, so if there are any missing values or inconsistent values, they have to be dealt with before we build the training model

We’ll be using the anyNA() method, which checks for any null values:

Next, we’re checking the summary of our data by using the summary() function

We can also do some customization for selecting C value(Cost) in Linear classifier. This can be done by inputting values in grid search.

The next code snippet will show you, building & tuning of an SVM classifier with different values of C.

We are going to put some values of C using expand.grid() into “grid” dataframe. Next step is to use this dataframe for testing our classifier at specific C values. It needs to be put in train() method with tuneGrid parameter.

These are Advance Data Science Using R Assignment 3 Solution

### Q3] Write down the step by step classification of naïve bayes classification in R.

Step 1:Import required libraries.

Step 2: Load the data set

Step 3: Check the structure of the dataset.

Step 4: Checking the summary.

Step 5: Train – Test Split.

Step 6: Separate the test labels from the test data.

Step 7: Train the model.

Step 8: Make predictions.

Step 9: Compare the predicted and actual values

These are Advance Data Science Using R Assignment 3 Solution

Get more about R language here: