This section will guide you through the process of creating, manipulating, and efficiently working with data frames.
A data frame in R is a table-like data structure used for storing data. It is one of the most commonly used data structures in R for data analysis, as it can hold different types of data (numeric, character, factor, etc.) in a rectangular format with rows and columns. Each column in a data frame can be of a different data type, making it similar to a table or spreadsheet. The rows typically represent observations or records, while the columns represent variables or features.
We'll learn about the following topics:
To create a data frame in R, you use the data.frame()
function.
#Create a data frame
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(30, 35, 25),
Height = c(165, 175, 170),
Married = c(TRUE, FALSE, TRUE)
)
#Print the data frame
print(df)
Name Age Height Married 1 Alice 30 165 TRUE 2 Bob 35 175 FALSE 3 Charlie 25 170 TRUE
To read a data frame from a CSV file in R, you can use the read.csv() function. Here’s the general syntax:
data_frame_name <- read.csv("path/to/your/file.csv", header = TRUE, sep = ",")
"path/to/your/file.csv": Replace this with the actual path to your CSV file.
Matrix vs. Dataframe
Matrix | Dataframe |
---|---|
Collection of data sets arranged in a two-dimensional rectangular organization. | Stores data tables that contain multiple data types in multiple columns called fields. |
It’s an m*n array with a similar data type. | It is a list of vectors of equal length. It is a generalized form of a matrix. |
It has a fixed number of rows and columns. | It has a variable number of rows and columns. |
The data stored in columns can be only of the same data type. | The data stored must be numeric, character, or factor type. |
Matrix is homogeneous. | DataFrames are heterogeneous. |
You can access individual columns, rows, or elements within a data frame using several different methods:
1. Access by Column Name:
#Access the "Age" column
df$Age
df[["Name"]]
df[["Name"]][2]
2. Access by Indexing:
#Access the first row
df[1, ]
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
#Access the first column
df[, 1]
n R, you can use negative indexing to exclude specific elements from a data structure, such as a data frame.
#Exclude the first column
df[, -1]
Age | Height | Married |
---|---|---|
<dbl> | <dbl> | <lgl> |
30 | 165 | TRUE |
35 | 175 | FALSE |
25 | 170 | TRUE |
#Access a specific element (row 2, column 3)
df[2, 3]
df[c(2,4), c(1,2)]
Name | Age | |
---|---|---|
<chr> | <dbl> | |
2 | Bob | 35 |
NA | NA | NA |
df[2:3, 1:2]
Name | Age | |
---|---|---|
<chr> | <dbl> | |
2 | Bob | 35 |
3 | Charlie | 25 |
df[df$Age>25, ]
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
Function | Description |
---|---|
as.data.frame() | Convert a List to a Data Frame. |
str() | Displays the structure of the data frame, including data types and a preview of each column. |
nrow() and ncol() | Returns the number of rows and columns in a data frame. |
dim() | Returns the dimensions of the data frame (number of rows and columns). |
colnames() and rownames() | Get or set the column and row names of the data frame. |
head() and tail() | Displays the first few or last few rows of the data frame. |
summary() | Provides summary statistics for each column in the data frame. |
subset() | Subsets a data frame based on conditions. |
merge() | Merges two data frames by common columns or row names. |
rbind() and cbind() | Binds data frames by rows or columns. |
apply() | Applies a function to rows or columns of the data frame. |
is.data.frame() | Checks if an object is a data frame. |
order() | Sorts a data frame by one or more columns. |
duplicated() and unique() | Finds duplicate rows or returns unique rows. |
transform() | Adds new columns or transforms existing columns. |
as.data.frame()
: Convert a List to a Data Frame.
lst <- list(Name = c("Alice", "Bob"), Age = c(25, 30))
df_lst <- as.data.frame(lst)
print(df_lst)
Name Age 1 Alice 25 2 Bob 30
str()
: Displays the structure of the data frame, including data types and a preview of each column.
str(df)
'data.frame': 3 obs. of 4 variables: $ Name : chr "Alice" "Bob" "Charlie" $ Age : num 30 35 25 $ Height : num 165 175 170 $ Married: logi TRUE FALSE TRUE
nrow()
and ncol()
: Returns the number of rows and columns in a data frame.
nrow(df)
ncol(df)
dim()
: Returns the dimensions of the data frame (number of rows and columns).
dim(df)
colnames()
and rownames()
: Get or set the column and row names of the data frame.
colnames(df)
rownames(df)
head()
and tail()
: Displays the first few or last few rows of the data frame.
head(df) #Returns the first 6 rows (or fewer if the frame is smaller)
tail(df) #Returns the last 6 rows
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
3 | Charlie | 25 | 170 | TRUE |
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
3 | Charlie | 25 | 170 | TRUE |
summary()
: Provides summary statistics for each column in the data frame.
summary(df)
Name Age Height Married Length:3 Min. :25.0 Min. :165.0 Mode :logical Class :character 1st Qu.:27.5 1st Qu.:167.5 FALSE:1 Mode :character Median :30.0 Median :170.0 TRUE :2 Mean :30.0 Mean :170.0 3rd Qu.:32.5 3rd Qu.:172.5 Max. :35.0 Max. :175.0
subset()
: Subsets a data frame based on conditions.
subset(df, Age > 25)
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
subset(df, Height != 170)
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
merge()
: Merges two data frames by common columns or row names.
df1 <- data.frame(ID = c(1, 2), Name = c("Alice", "Bob"))
df2 <- data.frame(ID = c(1, 2), Age = c(25, 30))
merge(df1, df2, by = "ID")
ID | Name | Age |
---|---|---|
<dbl> | <chr> | <dbl> |
1 | Alice | 25 |
2 | Bob | 30 |
rbind()
and cbind()
: Binds data frames by rows or columns.
df1 <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30))
df2 <- data.frame(Name = c("Charlie", "David"), Age = c(35, 40))
rbind(df1, df2)
Name | Age |
---|---|
<chr> | <dbl> |
Alice | 25 |
Bob | 30 |
Charlie | 35 |
David | 40 |
apply()
: Applies a function to rows or columns of the data frame.
df[, -1]
Age | Height | Married |
---|---|---|
<dbl> | <dbl> | <lgl> |
30 | 165 | TRUE |
35 | 175 | FALSE |
25 | 170 | TRUE |
apply(df['Age'], 2, mean)
is.data.frame()
: Checks if an object is a data frame.
is.data.frame(df)
order()
: Sorts a data frame by one or more columns.
#Sorts the data frame by the Age column
df[order(df$Age), ]
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
3 | Charlie | 25 | 170 | TRUE |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
duplicated()
and unique()
: Finds duplicate rows or returns unique rows.
duplicated(df)
unique(df)
Name | Age | Height | Married | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | |
1 | Alice | 30 | 165 | TRUE |
2 | Bob | 35 | 175 | FALSE |
3 | Charlie | 25 | 170 | TRUE |
transform()
: Adds new columns or transforms existing columns.
df <- transform(df, AgeInMonths = Age * 12)
df
Name | Age | Height | Married | AgeInMonths |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <lgl> | <dbl> |
Alice | 30 | 165 | TRUE | 360 |
Bob | 35 | 175 | FALSE | 420 |
Charlie | 25 | 170 | TRUE | 300 |
Mutability: efers to the ability to modify an object after it has been created. In the context of data frames (or other data structures like lists, matrices, or vectors), mutability refers to whether the content of the data structure can be changed (e.g., adding, removing, or modifying elements) after it has been created.
df$Weight <- c(65, 75, 85)
print(df)
Name Age Height Married AgeInMonths Weight 1 Alice 30 165 TRUE 360 65 2 Bob 35 175 FALSE 420 75 3 Charlie 25 170 TRUE 300 85
df$Weight <- NULL
print(df)
Name Age Height Married AgeInMonths 1 Alice 30 165 TRUE 360 2 Bob 35 175 FALSE 420 3 Charlie 25 170 TRUE 300
new_row <- data.frame(Name = "David", Age = 40, Height = 180, Married = FALSE, AgeInMonths = 480)
df <- rbind(df, new_row)
print(df)
Name Age Height Married AgeInMonths 1 Alice 30 165 TRUE 360 2 Bob 35 175 FALSE 420 3 Charlie 25 170 TRUE 300 4 David 40 180 FALSE 480