This is an English version of the article “R – Um ambiente de trabalho gratuito para análise e visualização de dados” intended for a course I will be teaching. Sorry, some graphs are in Portuguese, but their meaning is obvious.
R is a free software (or language) for data analysis and graphic display with a broad user-base, which leads to the availability of a large amount of helpful information for users. Various packages are available in repositories and can be easily loaded. New packages can also be created by users for the efficient use of this work environment.
The learning curve can be a little steep in the beginning, particularly for those who are not used to command lines. Nevertheless, this software is used in a broad range of areas and it is worth learning it.
The objective of this article is not to teach R, but to offer a brief notion of what is possible with this tool. Several books area available to this language and various aspects of data manipulation, statistical data analysis and graph preparation. See the recommendations at the end of the article.
R is available free of charge for various operating systems (e.g. Linux, Mac OS X and Windows) here:
We suggest, in addition to R, the use of RStudio, which is a free interface for R. After installing R, RStudio can be downloaded for various operating systems here:
The RStudio interface
The RStudio interface is organized in four areas, as shown in Fig. 1. The main area is window 2: the work area. Here you can type commands, which are executed when you press ENTER.
In area 1 you can edit R scripts – sequences of commands that can be saved as a file to be used repeatedly, for example, for the repetitive analysis of data. Commands in this window can be executed typing CTRL + ENTER.
Area 3 shows variable defined and the history of commands.
Area 4 shows graphs graphs, lists other packages that can be loaded or that are available, help, etc.
As a very simple exemple, click in window 2 and type the lines below, typing ENTER at the end of each line. First, though, read the tips below:
example <- rnorm(n = 1000, mean = 56, sd = 4) hist(example)
- The symbol “<-” (or “=”) assigns the value of the function rnorm to the variable “example”. The symbol can be typed manually as “<” (less than) and “-” (minus symbol), but there are shortcuts to enter it automatically (Option + “-” in Mac and Alt + “-” in Windows; see shortcuts here).
- The function rnorm creates random numbers with the normal distribution. As typed above, the function generates 1000 values from a normal distribution with mean value 56 and standard deviation 4.
- You don’t need to type “n =”, “mean = “, etc., you can simply click TAB after opening the parenthesis. The options will show up automatically.
- Note that, after typing ENTER in the first line, the variable “exemple” will show up in window 3.
- The function hist creates a histogram with the values in “example”.
Here is the graph obtained.
Other examples of graphs
To demonstrate two basic graphs in R, we will use data already included in the package. The data can be loaded typing:
To see the information about the data, use the question mark:
The question mark can also be used to obtain help on other functions. Try, for example:
The data “Orange” has three columns: the number of the tree, the age of the tree when it was measured, and the circumference of the tree. To see only a sample of the data, type:
To make a graph of the tree circumference versus age, type:
plot(circumference ~ age, Orange, subset = Tree==3)
The command plots the circumference as a function of age using the data “Orange” for the tree number 3.
This is only a quick example of what can be done. Another package called “Lattice” allows more sophisticated graphs to be obtained:
To load the package, type:
To make the same graph, now for all trees, type:
xyplot(circumference ~ age | Tree, Orange)
Another example, shown below, can be obtained typing:
xyplot(circumference ~ age, Orange, groups = Tree)
User data, as well as graphs in various formats (jpeg, pdf, etc.), can be obviously imported and exported.
The language also has programming functions such as for and while loops. One of the main advantages, however, is the possibility of creating functions to be used in other parts of the program to automate the data analysis. These functions can be packaged and made available to other users.
There also various functions and packages for data analysis. As a simple example, type the following example to obtain statistics on the variable “example”:
As a final example, let’s look at the effect of two drugs on the hours of sleep of 10 patients. The data can be loaded typing:
The data can be seen clicking in the icon “sleep” in window 3.
The data show the number of hours of sleep for each patient (ID) depending on the drug (group). To see the average hours of extra sleep (extra) as a function of the drug (group), type:
aggregate(extra ~ group, sleep, mean)
For other functions, we recommend the references below:
Some manuals can be downloaded from the R website:
List of commands and functions can be found in the list below. The last link shows cheat sheets for various areas.
RStudio shortcuts can be found here:
Some books are listed below. Note that the books can be found in digital format through some university libraries – check if your institution has access to them.
- Phil Spector. Data Manipulation with R. Springer, 2008.
- Deepayan Sarkar. Lattice – Multivariate Data Visualization with R. Springer, 2008.
- John M. Chambers. Software for Data Analysis – Programming with R. Springer, 2008.
- Sarah Stowell. Using R for Statistics. Apress.
Other internet resources: