Top 50 Interview questions of R

R Interview Questions

R Interview Questions

A list of frequently asked R Interview Questions and answers are given below.

1) What is R?

R is an interpreted computer programming language which was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand". It is a software environment used to analyze statistical information, graphical representation, reporting, and data modeling. R is the implementation of the S programming language, which is combined with lexical scoping semantics.

2) Differentiate between vector, List, Matrix, and Data frame.

A vector is a series of data elements of the same basic type. The members in the vector are known as a component.

The R object that contains elements of different types such as numbers, strings, vectors, or another list inside it, is known as List .

A two-dimensional data structure used to bind the vectors from the same length, known as the matrix . The matrix contains the same types of elements.

A Data frame is a generic form of a matrix. It is a combination of lists and matrices. In the Data frame, different data columns contain different data types.

3) Give names of those packages which are used for data imputation.

There are the following packages which are used for data imputation

  1. MICE
  2. missFores
  3. Mi
  4. Hmisc
  5. Amelia
  6. imputeR

4) Explain initialize() function in R?

This function is used to initialize the private data members while declaring the object.

5) How can we find the mean of one column with respect to another?

In iris dataset, there are five columns, i.e., Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species. We will calculate the mean of Sepal-Length across different species of iris flower using the mean() function from the mosaic package.

  1. mean(iris$Sepal.Length~iris$Species)

6) What is a Random Walk model?

A random walk is the simplest example of a non-stationary process. A random walk has no specified mean or variance, strong dependence over time, and its changes or increments are white noise. Simulating random walk in R:

arima.sim(model=list(order=c(0,1,0)),n=40)->rw ts.plot(rw)

7) What is a White Noise model?

It is a basic time series model and a simple example of a stationary process. A white noise model has a fixed constant mean, a fixed constant variance, and no correlation over time. We can simulate a white noise model in the following way:

arima.sim(model=list(order=c(0,0,0)),n=50)->wn

8) Give any five features of R.

  1. Simple and effective programming language.
  2. It is a data analysis software.
  3. It gives effective storage facility and data handling.
  4. It gives high extensible graphical techniques.
  5. It is an interpreted language.

9) Differentiate between R and Python in terms of functionality?

For data analysis, R has inbuilt functionality, but in Python, the data analysis functionalities are not inbuilt. They are available by packages like Pandas and Numpy.

10) What are the applications of R?

There are various applications available in real-time. These applications are as follows:

  1. Facebook
  2. Google
  3. Twitter
  4. HRDAG
  5. NDAA

11) Explain RStudio.

RStudio is an integrated development environment which allows us to interact with R more readily. RStudio is similar to the standard RGui, but it is considered more user-friendly. This IDE has various drop-down menus, windows with multiple tabs, and so many customization processes. The first time when we open RStudio, we will see three Windows. The fourth Window will be hidden by default.

12) What are the advantages and disadvantages of R?

Advantages

  1. Open Source
  2. Data Wrangling
  3. Array of Packages
  4. Platform Independent
  5. Machine Learning Operations

Disadvantages

  1. Weak origin
  2. Data Handling
  3. Basic Security
  4. Complicated Language
  5. Lesser Speed

13) What is the purpose behind R and Hadoop integration?

  1. For executing Hadoop to execute R code.
  2. For using R to access the data stored in Hadoop.

14) Give the name of the Hadoop integration methods.

  1. R Hadoop
  2. Hadoop Streaming
  3. RHIPE
  4. ORCH

15) What will be the output of the expression all(NA==NA)?

[1] NA

16) What is the difference b/w sample() and subset() in R?

The sample() method is used to choose a random sample of size n from a dataset while the subset method is used to choose variables and observations.

17) Why do we use the command - install.packages(file.choose(), repos=NULL)?

This command is used to install an R package from the local directory by browsing and selecting the file.

18) Give the command to create a histogram and to remove a vector from the R workspace?

hist() and rm() function are used as a command to create a histogram and remove a vector from the R workspace.

19) Differentiate b/w “%%” and “%/%”.

The “%%” provides a reminder of the division of the first vector with the second, and the “%/%” gives the quotient of the division of the first vector with the second.

20) Why do we use apply() function in R?

This is used to apply the same function to each of the elements in an Array. For example, finding the mean of the rows in every row.

21) Differentiate between library() and require() functions.

If the desired package cannot be loaded, then the library() function gives an error message and display while the required () function is used inside the function and throws a warning message whenever a particular package is not found.

22) What is the t-test() in R?

The t-test() function is used to determine that the mean of the two groups are equal or not.

23) What is the use of with() and by() functions in R?

The with() function applies an expression to a dataset, and the by() function applies a function to each level of factors.

24) Differentiate b/w lapply and sapply.

The lapply is used to show the output in the form of the list, whereas sapply is used to show the output in the form of a vector or data frame.

25) Explain aggregate() function.

The aggregate() function is used to aggregate data in R. There are two methods which are collapsing data by using one or more BY variable and other is an aggregate() function in which By variable should be in the list.

26) Explain the doBy package?

This package is used to define the desired table using function and model formula.

27) Explain the use of the table() function.

This function is used to create the frequency table in R.

28) Explain fitdistr() function?

This function is used to give the maximum likelihood fitting of univariate distribution and defined under the MASS package.

29) What are GGobi and iPlots?

The GGobi is an open-source program for visualization to exploring high dimensional typed data, and the iPlots is a package which provides bar plots, mosaic plots, box plots, parallel plots, histograms, and scatter plots.

30) Explain the lattice package.

The lattice package is meant to improve upon the base R graphics by giving better defaults and has the ability to display multivariate relationships easily.

31) Explain anova() function.

The anova() function is used for comparing the nested models.

32) Explain cv.lm() and stepAIC() function.

The cv.lm() function is defined under the DAAG package used for k-fold validation while the stepAIC() function is defined under the MASS package that performs stepwise model selection under exactAIC.

33) Explain leaps() function.

The leaps() function is used to perform the all-subsets regression and defined under the leaps package.

34) Explain relaimpo and robust package.

This package is used to measure the relative importance of every predictor in the model, and the robust package gives a library of robust methods, including regression.

35) Give full form of MANOVA and what is the use of it.

MANOVA stands for Multivariate Analysis of Variance, and it is used to test more than one dependent variable simultaneously.

36) Explain mashapiro.test() and barlett.test().

This function defines in the mvnormtest package and produces the Shapiro-wilk test to multivariate normality. The barlett.test() is used to provide a parametric k-sample test of the equality of variances.

37) Explain the use of the forecast package.

The forecast package gives the functions which are used to automatic selection of exponential and ARIMA models.

38) Differentiate between qda() and lda() function.

The qda() function prints a quadratic discriminant function while lda() function print the discriminant functions based on the centered variable.

39) Explain the auto.arima() and principal() function.

The auto.arima() function handle both the seasonal and non-seasonal ARIMA model and the principal() function used for rotating and extracting the principal components.

40) Explain FactoMineR.

The FactoMineR is a package that includes qualitative and quantitative variables. The observations and supplementary variables are also included in these packages.

41) What is the full form of SEM and CFA?

CFA stands for Confirmatory Factor Analysis, and SEM stands for Structural Equation Modeling.

42) Define cluster.stats() and pvclust() function().

The cluster.stats() function define in the fpc package that provides a method for comparing the similarity of two cluster solutions using different validation criteria, and the pvclust() function is defined in the pvclust package that provides p-values for hierarchical clustering.

43) Define MATLAB and party packages.

This package includes wrapper functions and variable which are used for replicating Matlab function calls.

44) Explain S3 and S4 systems.

In oops, the S3 is used to overload any function. So that we can call the functions with different names, and it depends on the type of input parameter or the number of parameters, and the S4 is the most important characteristic of oops. However, this is a limitation, as it is quite difficult to debug. There is an optional reference class for S4.

45) Give names of visualization packages.

There are the following packages of visualization in R:

  1. Plotly
  2. ggplot2
  3. tidyquant
  4. geofacet
  5. googleVis
  6. Shiny

46) Explain Chi-Square Test

The Chi-Square Test is used to analyze the frequency table (i.e., contingency table), which is formed by two categorical variables. The chi-square test evaluates whether there is a significant relationship between the categories of the two variables.

47) Explain Random Forest.

The Random Forest is also known as Decision Tree Forest. It is one of the popular decision tree-based ensemble models. The accuracy of these models is higher than other decision trees. This algorithm is used for both classification and regression applications.

48) Explain Time Series Analysis.

Any metric which is measured over regular time intervals creates a time series. Analysis of time series is commercially important due to industrial necessity and relevance, especially with respect to the forecasting (demand, supply, and sale, etc.). A series of data points in which each data point is associated with a timestamp is known as time series.

49) Explain Pie chart in R.

R programming language has several libraries for creating charts and graphs. A pie-chart is a representation of values in the form of slices of a circle with different colors.

50) Explain Histogram.

A histogram is a type of bar chart which shows the frequency of the number of values which are compared with a set of values ranges. The histogram is used for the distribution, whereas a bar chart is used for comparing different entities. In the histogram, each bar represents the height of the number of values present in the given range.