Adv Data Science using R Quiz and Assignment 3

These are Advance Data Science Using R Assignment 3 Solution

Quiz

Q1. Which of the following language is used in Data science?
R
C
C++
Ruby

Answer: R


Q2. What is the primary file type of R?
Vector
Text file
RScripts
Statistical file

Answer: RScripts


Q3. Which one of the following R packages is used for data products?
haven
igraph
slidify
forecast

Answer: slidify


These are Advance Data Science Using R Assignment 3 Solution


Q4. Which of the following is valid for checking categorical variable?
Level
Table
Unique
All of the above

Answer: All of the above


Q5. Suppose ABC is the matrix of 3 rows and 4 columns. Choose correct option(s) to rename columns:
row_names(ABC)= c(“row1”,”row2”,”row3”)
rownames(ABC)=c(“row1”,”row2”)
row(ABC)=c(“row1”,”row2”)
rownames(ABC)=c(“row”,”row2”,”row3”)

Answer: rownames(ABC)=c(“row”,”row2”,”row3”)


Q6. Arrange in proper order of data type:
Logical, integer, numeric, character
Integer, numeric, character, logical
Character, logical, integer, numeric
Numeric, integer, character, logical

Answer: Logical, integer, numeric, character


These are Advance Data Science Using R Assignment 3 Solution


Q7. What is the output of below code:
A=10
B=20
print(A,B)

10 20
Error
(10, 20)
None of the above

Answer: Error


Q8. Return statement is compulsory while writing function in R
True
False

Answer: False


Q9. Last variable in function is by default return variable in R
True
False

Answer: True


These are Advance Data Science Using R Assignment 3 Solution


Q10. What package is need to be install for reading?
Read_excel
Readxl
Readcsv
read_csv

Answer: Readxl


Q11. what is the output of below mentioned code?
logic1=c(T,F,F,T,F,T)
print(which (logic))

1 4 6
2 3 6
6 4 1
1 2 3

Answer: 1 4 6


Q12. If A = c (1, 13, 42, 13, 4)  then what is A = A [ -4 ]?
1, 13, 42, 4
1, 13, 42, 13
13
1, 42, 13, 4

Answer: 1, 13, 42, 4


These are Advance Data Science Using R Assignment 3 Solution


Q13. what function can be used to split the string?
Output will be : “Navin”      “Mr. Naresh J”

strsplit(name,”[.]”)
charsplit(name,”[,]”)
stringsplit(name)
strsplit(name,”[,]”)

Answer: strsplit(name,”[,]”)


Q14. i=100 , how to find out data type of i
Option 1
type(i)
class(i)
none of the above

Answer: class(i)


Q15. Dt = “01-12-2020” is in the form of character. What is the option to convert date into “MM-DD-YYYY”
To_date (dt, ”MM – DD – YYYY”)
date( x = dt, format = “%m / %d / %Y”)
Date ( x = dt, format = “%m / %d / %Y”)
none of the above

Answer: Date ( x = dt, format = “%m / %d / %Y”)


These are Advance Data Science Using R Assignment 3 Solution


Assignment 3

Q1] What Is KNN Algorithm? Features Of KNN Algorithm. How Does KNN Algorithm Work? Write KNN algorithm pseudocode and Practical Implementation Of KNN Algorithm In R.

KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a new data point into the target class, depending on the features of its neighboring data points. 

See also  Adv Data Science using R Quiz and Assignment 1

The KNN algorithm has the following features:

  • KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points.
  • It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
  • It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the data point into the class it is most similar to.
  • Unlike most algorithms, KNN is a non-parametric model which means that it does not make any assumptions about the data set. This makes the algorithm more effective since it can handle realistic data.
  • KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative function from the training data.
  • KNN can be used for solving both classification and regression problems.

KNN Algorithm Pseudocode

Consider the set, (Xi, Ci),

  • Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
  • Ci denotes the output class for Xi for each i

The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the total number of classes is denoted by ‘c’.

Now let’s pretend that there’s a data point ‘x’ whose output class needs to be predicted. This can be done by using the K-Nearest Neighbour (KNN) Algorithm.

Calculate D(x, xi), where ‘i’ =1, 2, ….., n and ‘D’ is the Euclidean measure between the data points.

The calculated Euclidean distances must be arranged in ascending order.

Initialize k and take the first k distances from the sorted list.

Figure out the k points for the respective k distances.

Calculate ki, which indicates the number of data points belonging to the ith class among k points i.e. k ≥ 0

If ki >kj ∀ i ≠ j; put x in class i.

Practical Implementation Of KNN Algorithm In R

Problem Statement: To study a bank credit dataset and build a Machine Learning model that predicts whether an applicant’s loan can be approved or not based on his socio-economic profile. 

Dataset Description: The bank credit dataset contains information about 1000s of applicants. This includes their account balance, credit amount, age, occupation, loan records, etc. By using this data, we can predict whether or not to approve the loan of an applicant.

IMG_256

Logic: This problem statement can be solved using the KNN algorithm that will classify the applicant’s loan request into two classes:

  • Approved
  • Disapproved

These are Advance Data Science Using R Assignment 3 Solution

Q2] Develop a Machine Learning Model using SVM in R to solve A Business Problem. Add Screenshots of the graphs and code to validate your answer.

Problem Statement:

To Study a heart disease data set and to model a classifier for predicting whether a patient is suffering from any heart disease or not.

KXuAFVBE71sgvs98GcTJi6vg0pEupTQB3afuEOrzpj4Z smCjgw38BAbRuliBEKcxShzrd4aAhdb8 KzXeOerk9C6JlOKII2tmJRTEHBEpNA cXFkrro6cYmeLyeZdzSfM3PIFWcQ0T2 zhmxNBIuGAMjL59yjQt6vPnMJn5WWN

We’ll be using the Caret package. The caret package is also known as the Classification And REgression Training, has tons of functions that helps to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning, unsupervised learning algorithms, etc.

See also  Adv Data Science using R Quiz and Assignment 1

So, to use it, we first need to install it using this command:

install.packages(“caret”)

The caret package is very helpful because it provides us direct access to various functions for training our model with various machine learning algorithms like KNN, SVM, decision tree, linear regression, etc.

After installing it, we just need to load the package into our console, to do that we have this code:

library(‘caret’)

Our next step is to load the data set.

For this demo, we’ll be using a Heart Disease data set which consists of various attributes like the person’s age, sex, cholesterol level and etc. In the same data set, we’ll have a target variable, which is used to predict whether a patient is suffering from any heart disease or not

In short, we’ll be using SVM to classify whether a person is going to be prone to heart disease or not.

The data set looks like this:

IMG_256

This data set has around 14 attributes and the last attribute is the target variable which we’ll be predicting using our SVM model.

Now it’s time to load the data set:

heart <- read.csv(“/Users/HxR/Desktop/heart_dataset.csv”, sep = ‘,’, header = FALSE)

In the above line of code, we’re reading the dataset which is stored in a CSV format and that’s why we’ve used the read.csv function to read it from the specified path.

The ‘sep’ attribute indicates that the data is stored in a CSV or Comma Separated Version.

Now that we’ve imported our dataset, let’s check the structure of our dataset:

For checking the structure of data frame we can call the function str():

11mSwTyg3vBBjl3H1ZMEuRZUW4jaAFYTrJdiV64WhE8PPcO6laW3nu0HbgRHgJRDB4kKl3 w9ANpRUUc7FZeC8z3midHHB coGxBnZ65J7rP6Ha3F iRx0s

The output shows us that our dataset consists of 300 observations each with 14 attributes.

If you want to display the top 5-6 rows of the data set, use the head() function:

fwkrhsONmfhRkGOmLyfepm45uSkLOjfp9P9z2t3BrAInuICUif4VDlKG1fyQAML36 F0gS7vh6Yf PPPpCGzBgFXrTo03440WR3tV cei5Zi1Kv Ie0qbnS4JguNre8l2jkN V7 aChgNI9zhcicZd0yjT0ip8bHnswv4M4AdAkKZZbtGPZc5q2ktNNdm6Uu yFZ9lOKLg

Our next step is to split the data into training set and testing set, this is also called data splicing.

We’ll be using the training set specifically for our model building and the testing set for evaluating the model:

RseIKJPqZp eE

The caret package provides a method createDataPartition() which is basically for partitioning our data into train and test set.

See also  Adv Data Science using R Quiz and Assignment 2

We’ve passed 3 parameters to this createdatapartition() function:

The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V14, so we are passing heart$V14

The “p” parameter holds a decimal value in the range of 0-1. It’s to show the percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. So, 70% of the data is used for training and the remaining 30% is for testing the model.

The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list

Now this createDataPartition() method is returning a matrix “intrain”. This intrain matrix has our training data set and we’re storing this in the ‘training’ variable and the rest of the data, i.e. the remaining 30% of the data is stored in the testing variable.

Next, for checking the dimensions of our training data frame and testing data frame, we can use these:

hkOUUZrLsCG7H Xg26QS GozPqBESczw3 OYVfm iKaSBeaI23 6i2Gx HppEFte9BD 9x5dEjN91q0cL4TFyEprLxLAwu8hpKXMP79pACwytn5 37u

Our next step is to clean the data, so if there are any missing values or inconsistent values, they have to be dealt with before we build the training model

We’ll be using the anyNA() method, which checks for any null values:

z3ZcGU iq J6RJxDXGH HMnGq5rXjo9Klr1CdPvyhKWECLOaJRwatQB4X7SGiQVZe7i53UD9xL3TP1tlTj5nfX0htyPivaZflZcbIDTimmvmjZjiZI XsHZNRsSeamOsoBqgA9gFF8 mmq MoNsmPxrXFbB2rM 4jOTcL9bzWd YwiIT5OzkotXy8RJUuyhlajUMUXsRFw

Next, we’re checking the summary of our data by using the summary() function

edb76zaYo xQlyhXwBC9MzafECZ3KlO48JioJd VeOsIDRue4Z2ojM4yrgq2oyQRpF1W8Li roxlHUX5QMY7l0h NJ9 IzbcEbFSMSRTS1 A90cgHl0XkvaPktv nXIQ9vWY1DLi rYQF be88a28302E4i8XlcnLvAptaqkMzTN38O4GU 5cTNsUosh5aLhmf4QfivWvA

We can also do some customization for selecting C value(Cost) in Linear classifier. This can be done by inputting values in grid search. 

The next code snippet will show you, building & tuning of an SVM classifier with different values of C.

We are going to put some values of C using expand.grid() into “grid” dataframe. Next step is to use this dataframe for testing our classifier at specific C values. It needs to be put in train() method with tuneGrid parameter.

VWpA3TgCVJeFBPCg66KIpSnSk qiRTAHknTrx930 R3JcaE VLS1jMc QuFVqfEV56CQoAR yAGKbAntYoIsGRW3B8BM1GHlQkSxbZZsIhHDYaiyTyCl32AUAxietn6wmXS2hGhbBMLzjDflFaxoeTCaDmdyO pDuACSTZzWAIZSc5UD6SXoLCI2 0lGmM HNpFRBmoZ6A

These are Advance Data Science Using R Assignment 3 Solution

Q3] Write down the step by step classification of naïve bayes classification in R.

Step 1:Import required libraries.

Step 2: Load the data set

Step 3: Check the structure of the dataset.

Step 4: Checking the summary.

Step 5: Train – Test Split.

Step 6: Separate the test labels from the test data.

Step 7: Train the model.

Step 8: Make predictions.

Step 9: Compare the predicted and actual values


These are Advance Data Science Using R Assignment 3 Solution

Get more about R language here:

Advance Data Science using R all weeks assignments: Click Here


These are Advance Data Science Using R Assignment 3 Solution