NB: Introducing R

Programming for Data Science

What is R?

R is an open source programming language developed in the 1990s by and for statisticians.

It was based on the earlier langauge S, first developed at Bell Labs in the mid-1970s, also by and for statisticians.

It is a purpose-built language designed to provide a relatively low-code scripting language to explore, visualize, and model data.

Why Study R?

R is in many ways the original data science programming language.

Python borrows many concepts from R, including the data frame.

The R community provides insights into data processing through excellent documentation and well-designed code.

Although not as popular as it once was, it is still widely used — you may find yourself on a team that prefers it.

Many courses in the UVA SDS programs use R.

It’s not that hard, especially once you know basic programming concepts.

R’s Design

R was designed to support statistical computing above all.

In constrast to Python, it is not a general purpose language, although it may be used for many things.

It has a very strong academic community which is reflected in its high quality documentation, the variety of its scientific libraries, and in its well-organized resources.

It has many statistical functions built into it, i.e. to get started with statistical computing you don’t need to import anything.

It is based on what we might call vector-first thinking.

As with Python, everything is an object.

R Syntax

Syntax loosely follows traditional C-style syntax.

It uses braces { and } to form code blocks.

It uses semi-colons to end statements (optionally) or separate them if on same line.

Notably, assignments are made with <- or -> operators.

Dots . have no special meaning — they are not operators.

In effect, they are used like underscores _ in Python.

Single and double quotes have the same meaning, but double quotes tend to be preferred.

Use single quotes if you expect your string to contain double quotes.

Backslash escape applies to R strings, although since there are no raw strings — Python’s r" " — we often have to supply double backslashes in regular expressions.

Using R

Although there are many ways to run R programs, by far the most common is to use R Studio.

R Studio provides a fully-functional programming environment that includes an editor, a command-line, access to the file system, a help system, an installation system, etc.

You may use other programs run R too, though, such as VSCode and Jupyter.

In practice, however, RStudio is almost universally used.

R programs can be plain text files with an .r suffix, R Markdown files (.Rmd), or many other kinds of file.

We will discuss these in a later module.

Command Line R

R can also be run from the command line.

R is invoked by calling R:

> R

If properly installed, that should produce a message like this:

R version 4.3.1 (2023-06-16) – “Beagle Scouts” Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.

[Previously saved workspace restored]

Installing and Loading Packages

As with Python, R allows you to install and import packages to extend the program’s capabilities.

Packages can be installed from within a program as follows:

# install.packages("tm")

Here installed the Text Mining tm package.

You only have to install a package once.

You can also install packages using R Studio’s Package window or from the command line.

Once they are installed, you import them with the library() function:

library(tm)

Note that the library name is quoted in when installing, but not when using library.

Using R in Jupyter

If you want to use R within a Jupyter notebook file, you can create and load an R kernel using Anaconda’s package manager conda. In brief, here’s what you do.

First, at the command line:

conda create -n r_env r-essentials r-base
conda activate r_env
R # This opens the R shell

Then, in the R shell:

IRkernel::installspec(name = 'r_env', displayname = 'R Environment')
quit()

Now, fire up a Jupyter Lab or Jupyter Notebook instance from the OpenOnDemand page and select the kernel when you create a new notebook.

Note that the name r_env can be replaced by whatever you want.