6. Data Frame¶

This section will guide you through the process of creating, manipulating, and efficiently working with data frames.

A data frame in R is a table-like data structure used for storing data. It is one of the most commonly used data structures in R for data analysis, as it can hold different types of data (numeric, character, factor, etc.) in a rectangular format with rows and columns. Each column in a data frame can be of a different data type, making it similar to a table or spreadsheet. The rows typically represent observations or records, while the columns represent variables or features.

We'll learn about the following topics:

6.1. Creating Data Frames
6.2. Data Frame Indexing and Slicing
6.3. Built-in Data Frame Functions
6.4. Data Frame Properties

6.1. Creating Data Frames:¶

To create a data frame in R, you use the data.frame() function.

In [1]:

#Create a data frame
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(30, 35, 25),
  Height = c(165, 175, 170),
  Married = c(TRUE, FALSE, TRUE)
)

#Print the data frame
print(df)

     Name Age Height Married
1   Alice  30    165    TRUE
2     Bob  35    175   FALSE
3 Charlie  25    170    TRUE

To read a data frame from a CSV file in R, you can use the read.csv() function. Here’s the general syntax:

data_frame_name <- read.csv("path/to/your/file.csv", header = TRUE, sep = ",")

data_frame_name: This is the name you want to assign to your data frame.

"path/to/your/file.csv": Replace this with the actual path to your CSV file.

header: Logical value indicating whether the first row of the file contains column names. Set to TRUE (default) if your CSV has headers, otherwise set to FALSE.
sep: Specifies the character that separates the values in the file. The default is a comma (,), but you can change it if your data uses a different separator (e.g., sep = "\t" for tab-separated values)

Matrix vs. Dataframe

Matrix	Dataframe
Collection of data sets arranged in a two-dimensional rectangular organization.	Stores data tables that contain multiple data types in multiple columns called fields.
It’s an m*n array with a similar data type.	It is a list of vectors of equal length. It is a generalized form of a matrix.
It has a fixed number of rows and columns.	It has a variable number of rows and columns.
The data stored in columns can be only of the same data type.	The data stored must be numeric, character, or factor type.
Matrix is homogeneous.	DataFrames are heterogeneous.

6.2. Data Frame Indexing and Slicing:¶

You can access individual columns, rows, or elements within a data frame using several different methods:

1. Access by Column Name:

In [2]:

#Access the "Age" column
df$Age

30
35
25

In [3]:

df[["Name"]]

'Alice'
'Bob'
'Charlie'

In [4]:

df[["Name"]][2]

'Bob'

2. Access by Indexing:

In [5]:

#Access the first row
df[1, ]

A data.frame: 1 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE

In [6]:

#Access the first column
df[, 1]

'Alice'
'Bob'
'Charlie'

n R, you can use negative indexing to exclude specific elements from a data structure, such as a data frame.

In [7]:

#Exclude the first column
df[, -1]

A data.frame: 3 × 3
Age	Height	Married
<dbl>	<dbl>	<lgl>
30	165	TRUE
35	175	FALSE
25	170	TRUE

In [8]:

#Access a specific element (row 2, column 3)
df[2, 3]

175

In [9]:

df[c(2,4), c(1,2)]

A data.frame: 2 × 2
	Name	Age
	<chr>	<dbl>
2	Bob	35
NA	NA	NA

In [10]:

df[2:3, 1:2]

A data.frame: 2 × 2
	Name	Age
	<chr>	<dbl>
2	Bob	35
3	Charlie	25

In [11]:

df[df$Age>25, ]

A data.frame: 2 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE

6.3. Built-in Data Frame Functions:¶

Function	Description
as.data.frame()	Convert a List to a Data Frame.
str()	Displays the structure of the data frame, including data types and a preview of each column.
nrow() and ncol()	Returns the number of rows and columns in a data frame.
dim()	Returns the dimensions of the data frame (number of rows and columns).
colnames() and rownames()	Get or set the column and row names of the data frame.
head() and tail()	Displays the first few or last few rows of the data frame.
summary()	Provides summary statistics for each column in the data frame.
subset()	Subsets a data frame based on conditions.
merge()	Merges two data frames by common columns or row names.
rbind() and cbind()	Binds data frames by rows or columns.
apply()	Applies a function to rows or columns of the data frame.
is.data.frame()	Checks if an object is a data frame.
order()	Sorts a data frame by one or more columns.
duplicated() and unique()	Finds duplicate rows or returns unique rows.
transform()	Adds new columns or transforms existing columns.

as.data.frame(): Convert a List to a Data Frame.

In [12]:

lst <- list(Name = c("Alice", "Bob"), Age = c(25, 30))

df_lst <- as.data.frame(lst)

print(df_lst)

   Name Age
1 Alice  25
2   Bob  30

str(): Displays the structure of the data frame, including data types and a preview of each column.

In [13]:

str(df)

'data.frame':	3 obs. of  4 variables:
 $ Name   : chr  "Alice" "Bob" "Charlie"
 $ Age    : num  30 35 25
 $ Height : num  165 175 170
 $ Married: logi  TRUE FALSE TRUE

nrow() and ncol(): Returns the number of rows and columns in a data frame.

In [14]:

nrow(df)
ncol(df)

3

4

dim(): Returns the dimensions of the data frame (number of rows and columns).

In [15]:

dim(df)

3
4

colnames() and rownames(): Get or set the column and row names of the data frame.

In [16]:

colnames(df)
rownames(df)

'Name'
'Age'
'Height'
'Married'

'1'
'2'
'3'

head() and tail(): Displays the first few or last few rows of the data frame.

In [17]:

head(df)  #Returns the first 6 rows (or fewer if the frame is smaller)
tail(df)  #Returns the last 6 rows

A data.frame: 3 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE
3	Charlie	25	170	TRUE

A data.frame: 3 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE
3	Charlie	25	170	TRUE

summary(): Provides summary statistics for each column in the data frame.

In [18]:

summary(df)

     Name                Age           Height       Married       
 Length:3           Min.   :25.0   Min.   :165.0   Mode :logical  
 Class :character   1st Qu.:27.5   1st Qu.:167.5   FALSE:1        
 Mode  :character   Median :30.0   Median :170.0   TRUE :2        
                    Mean   :30.0   Mean   :170.0                  
                    3rd Qu.:32.5   3rd Qu.:172.5                  
                    Max.   :35.0   Max.   :175.0

subset(): Subsets a data frame based on conditions.

In [19]:

subset(df, Age > 25)

A data.frame: 2 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE

In [20]:

subset(df, Height != 170)

A data.frame: 2 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE

merge(): Merges two data frames by common columns or row names.

In [21]:

df1 <- data.frame(ID = c(1, 2), Name = c("Alice", "Bob"))
df2 <- data.frame(ID = c(1, 2), Age = c(25, 30))
merge(df1, df2, by = "ID")

A data.frame: 2 × 3
ID	Name	Age
<dbl>	<chr>	<dbl>
1	Alice	25
2	Bob	30

rbind() and cbind(): Binds data frames by rows or columns.

In [22]:

df1 <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30))
df2 <- data.frame(Name = c("Charlie", "David"), Age = c(35, 40))
rbind(df1, df2)

A data.frame: 4 × 2
Name	Age
<chr>	<dbl>
Alice	25
Bob	30
Charlie	35
David	40

apply(): Applies a function to rows or columns of the data frame.

In [23]:

df[, -1]

A data.frame: 3 × 3
Age	Height	Married
<dbl>	<dbl>	<lgl>
30	165	TRUE
35	175	FALSE
25	170	TRUE

In [24]:

apply(df['Age'], 2, mean)

Age: 30

is.data.frame(): Checks if an object is a data frame.

In [25]:

is.data.frame(df)

TRUE

order(): Sorts a data frame by one or more columns.

In [26]:

#Sorts the data frame by the Age column
df[order(df$Age), ]

A data.frame: 3 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
3	Charlie	25	170	TRUE
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE

duplicated() and unique(): Finds duplicate rows or returns unique rows.

In [27]:

duplicated(df)
unique(df)

FALSE
FALSE
FALSE

A data.frame: 3 × 4
	Name	Age	Height	Married
	<chr>	<dbl>	<dbl>	<lgl>
1	Alice	30	165	TRUE
2	Bob	35	175	FALSE
3	Charlie	25	170	TRUE

transform(): Adds new columns or transforms existing columns.

In [28]:

df <- transform(df, AgeInMonths = Age * 12)

In [29]:

df

A data.frame: 3 × 5
Name	Age	Height	Married	AgeInMonths
<chr>	<dbl>	<dbl>	<lgl>	<dbl>
Alice	30	165	TRUE	360
Bob	35	175	FALSE	420
Charlie	25	170	TRUE	300

6.4. Data Frame Properties:¶

Mutability: efers to the ability to modify an object after it has been created. In the context of data frames (or other data structures like lists, matrices, or vectors), mutability refers to whether the content of the data structure can be changed (e.g., adding, removing, or modifying elements) after it has been created.

Add a New Column:

In [30]:

df$Weight <- c(65, 75, 85)
print(df)

     Name Age Height Married AgeInMonths Weight
1   Alice  30    165    TRUE         360     65
2     Bob  35    175   FALSE         420     75
3 Charlie  25    170    TRUE         300     85

Remove a Column:

In [31]:

df$Weight <- NULL
print(df)

     Name Age Height Married AgeInMonths
1   Alice  30    165    TRUE         360
2     Bob  35    175   FALSE         420
3 Charlie  25    170    TRUE         300

Add a New Row:

In [32]:

new_row <- data.frame(Name = "David", Age = 40, Height = 180, Married = FALSE, AgeInMonths = 480)
df <- rbind(df, new_row)
print(df)

     Name Age Height Married AgeInMonths
1   Alice  30    165    TRUE         360
2     Bob  35    175   FALSE         420
3 Charlie  25    170    TRUE         300
4   David  40    180   FALSE         480