My data below background:

This is question&answer data
Each Section_ID have 3 respondents
I would like to divide each “Question” and “Answer” into 3 separate columns in order to see the consensus between 3
Under “Section_ID”, each respondents’ questions and answers are unique
Other:
Question’s data is equal to Answer’s data (Question=Answer)
Each Section_ID&Interview_date has the same value repeated

==================================================== What I would like to do:

Create new columns (divide each “Question” and “Answer” into 3 separate columns, original Section_ID and Interview_date can remain the same)
Assign Question&Answer’s values into different columns
I would like Section_ID and Interview_date aligned the longst value;
Add blank if the columns(Question&Answer) doesn’t have any more values
Thank you for reading!

Initial dataframe¶

In [1]:

import pandas as pd

dic = {'Question_ID': ["AD","AD","AD","HG","HG","JI","JI","JI","JI","OP","RT","ER","ER","ER","ER"],
    'Answer_ID': ['BB', 'BB', 'BB', 'GG', 'GG', 'UI', 'UI', 'UI', 'UI', 'LL', 'TT', 'RR', 'RR', 'RR', 'RR'],
    'Section_ID': ['X'] * 9 + ['Z'] * 6,
    'Person_Incharge': ['Y'] * 9 + ['Q'] * 6,}
df = pd.DataFrame(dic)
df
# dfx = df.set_index(["Section_ID", "Person_Incharge"]).unstack(["Question_ID","Answer_ID"])

Out[1]:

	Question_ID	Answer_ID	Section_ID	Person_Incharge
0	AD	BB	X	Y
1	AD	BB	X	Y
2	AD	BB	X	Y
3	HG	GG	X	Y
4	HG	GG	X	Y
5	JI	UI	X	Y
6	JI	UI	X	Y
7	JI	UI	X	Y
8	JI	UI	X	Y
9	OP	LL	Z	Q
10	RT	TT	Z	Q
11	ER	RR	Z	Q
12	ER	RR	Z	Q
13	ER	RR	Z	Q
14	ER	RR	Z	Q

New requirements¶

Transform the above dataframe according to the following requirements:

Each question has its own answer pair. The pair is one-to-one.
The pairing is fixed e.g. Question AD can only be paired with Question BB.
Each pair is paired with other pairs e.g. (AD, BB) can only be paired with (HG, GG) and (JI, UI)
Pairing between multiple pairs is not always available. For example, for (AD, BB), (HG, GG), and (JI, UI), sometimes only (AD, BB) is available while the other pairs are missing from the row. When this happens, the missing pairs should be filled with blank values.
The location of pairs are also fixed e.g. (AD, BB) can only be in Question1 and Answer1 while (ER, RR) can only be in Question3 and Answer 3.

We'll start by identifying the Q&A pair.

The format is {Question - Answer}.

AD - BB
HG - GG
JI - UI
OP - LL
RT - TT
ER - RR

Next, the pair between multiple pairs:

The format is {(Question - Answer) - (Question - Answer) - ...}

(AD - BB) - (HG - GG) - (JI - UI)
(OP - LL) - (RT - TT) - (ER - RR)

Put each Q&A pair into their own dataframes¶

In [2]:

# use groupby to group questions together

# grouped_questions = [{x: y} for x, y in df.groupby('Question_ID', as_index=False)]
grouped_questions = {}
for x, y in df.groupby('Question_ID'):
    grouped_questions[x] = y.reset_index(drop=True)

# example pair for question AD
grouped_questions['AD']

Out[2]:

	Question_ID	Answer_ID	Section_ID	Person_Incharge
0	AD	BB	X	Y
1	AD	BB	X	Y
2	AD	BB	X	Y

In [3]:

# we can either concat or merge depending on our use case
# we start by identifying which pairs the question id and its question number

question_1 = ['AD', 'OP']
question_2 = ['HG', 'RT']
question_3 = ['JI', 'ER']

# next, iterate by the length of the any one of the question list
# in this case, the iteration count will be 2

final_df = pd.DataFrame()
for index in range(len(question_1)):
    first_question = grouped_questions[question_1[index]]
    second_question = grouped_questions[question_2[index]]
    third_question = grouped_questions[question_3[index]]
    
    # merge first question with second question
    merged_df = first_question.merge(second_question, how='outer', left_index=True, right_index=True, suffixes=["_1", "_2"]) 
    
    # merge second question with third question
    merged_df = merged_df.merge(third_question, how='outer', left_index=True, right_index=True)
    merged_df = merged_df.rename({'Question_ID': 'Question_ID_3', 'Answer_ID': 'Answer_ID_3'}, axis=1)
    merged_df = merged_df.fillna("Blank")
    
    if final_df.shape[0] == 0:
        final_df = merged_df
    else:
        final_df = pd.concat([final_df, merged_df])
        
final_df.reset_index(drop=True)

Out[3]:

	Question_ID_1	Answer_ID_1	Section_ID_1	Person_Incharge_1	Question_ID_2	Answer_ID_2	Section_ID_2	Person_Incharge_2	Question_ID_3	Answer_ID_3	Section_ID	Person_Incharge
0	AD	BB	X	Y	HG	GG	X	Y	JI	UI	X	Y
1	AD	BB	X	Y	HG	GG	X	Y	JI	UI	X	Y
2	AD	BB	X	Y	Blank	Blank	Blank	Blank	JI	UI	X	Y
3	Blank	Blank	Blank	Blank	Blank	Blank	Blank	Blank	JI	UI	X	Y
4	OP	LL	Z	Q	RT	TT	Z	Q	ER	RR	Z	Q
5	Blank	Blank	Blank	Blank	Blank	Blank	Blank	Blank	ER	RR	Z	Q
6	Blank	Blank	Blank	Blank	Blank	Blank	Blank	Blank	ER	RR	Z	Q
7	Blank	Blank	Blank	Blank	Blank	Blank	Blank	Blank	ER	RR	Z	Q

Because we merged with Outer, we'll need to clean up the columns¶

In [4]:

for num in range(1, 3):
    final_df = final_df.drop(["Person_Incharge_{}".format(num), "Section_ID_{}".format(num)],axis=1)
final_df

Out[4]:

	Question_ID_1	Answer_ID_1	Question_ID_2	Answer_ID_2	Question_ID_3	Answer_ID_3	Section_ID	Person_Incharge
0	AD	BB	HG	GG	JI	UI	X	Y
1	AD	BB	HG	GG	JI	UI	X	Y
2	AD	BB	Blank	Blank	JI	UI	X	Y
3	Blank	Blank	Blank	Blank	JI	UI	X	Y
0	OP	LL	RT	TT	ER	RR	Z	Q
1	Blank	Blank	Blank	Blank	ER	RR	Z	Q
2	Blank	Blank	Blank	Blank	ER	RR	Z	Q
3	Blank	Blank	Blank	Blank	ER	RR	Z	Q

In [8]:

# to put section id and person incharge to the front

col = final_df.pop('Section_ID')
final_df.insert(0, col.name, col)

col = final_df.pop('Person_Incharge')
final_df.insert(1, col.name, col)

# also reset the index
final_df.reset_index(drop=True)

Out[8]:

	Section_ID	Person_Incharge	Question_ID_1	Answer_ID_1	Question_ID_2	Answer_ID_2	Question_ID_3	Answer_ID_3
0	X	Y	AD	BB	HG	GG	JI	UI
1	X	Y	AD	BB	HG	GG	JI	UI
2	X	Y	AD	BB	Blank	Blank	JI	UI
3	X	Y	Blank	Blank	Blank	Blank	JI	UI
4	Z	Q	OP	LL	RT	TT	ER	RR
5	Z	Q	Blank	Blank	Blank	Blank	ER	RR
6	Z	Q	Blank	Blank	Blank	Blank	ER	RR
7	Z	Q	Blank	Blank	Blank	Blank	ER	RR

DISREGARD: Older solutions¶

In [6]:

df_split_questions = df.copy()
question_ids = ['AD', 'HG', 'JI', 'OP', 'RT', 'ER']

for id in question_ids:
    df_split_questions['Question_ID_' + id] = df['Question_ID']==id
df_split_questions

Out[6]:

	Question_ID	Answer_ID	Section_ID	Person_Incharge	Question_ID_AD	Question_ID_HG	Question_ID_JI	Question_ID_OP	Question_ID_RT	Question_ID_ER
0	AD	BB	X	Y	True	False	False	False	False	False
1	AD	BB	X	Y	True	False	False	False	False	False
2	AD	BB	X	Y	True	False	False	False	False	False
3	HG	GG	X	Y	False	True	False	False	False	False
4	HG	GG	X	Y	False	True	False	False	False	False
5	JI	UI	X	Y	False	False	True	False	False	False
6	JI	UI	X	Y	False	False	True	False	False	False
7	JI	UI	X	Y	False	False	True	False	False	False
8	JI	UI	X	Y	False	False	True	False	False	False
9	OP	LL	Z	Q	False	False	False	True	False	False
10	RT	TT	Z	Q	False	False	False	False	True	False
11	ER	RR	Z	Q	False	False	False	False	False	True
12	ER	RR	Z	Q	False	False	False	False	False	True
13	ER	RR	Z	Q	False	False	False	False	False	True
14	ER	RR	Z	Q	False	False	False	False	False	True

In [7]:

df_split_answers = df_split_questions.copy()
answer_ids = ['BB', 'GG', 'UI', 'LL', 'RR']
for id in answer_ids:
    df_split_answers['Answer_ID_' + id] = df['Answer_ID']==id
df_split_answers.drop(['Question_ID', 'Answer_ID'], axis=1)

Out[7]:

	Section_ID	Person_Incharge	Question_ID_AD	Question_ID_HG	Question_ID_JI	Question_ID_OP	Question_ID_RT	Question_ID_ER	Answer_ID_BB	Answer_ID_GG	Answer_ID_UI	Answer_ID_LL	Answer_ID_RR
0	X	Y	True	False	False	False	False	False	True	False	False	False	False
1	X	Y	True	False	False	False	False	False	True	False	False	False	False
2	X	Y	True	False	False	False	False	False	True	False	False	False	False
3	X	Y	False	True	False	False	False	False	False	True	False	False	False
4	X	Y	False	True	False	False	False	False	False	True	False	False	False
5	X	Y	False	False	True	False	False	False	False	False	True	False	False
6	X	Y	False	False	True	False	False	False	False	False	True	False	False
7	X	Y	False	False	True	False	False	False	False	False	True	False	False
8	X	Y	False	False	True	False	False	False	False	False	True	False	False
9	Z	Q	False	False	False	True	False	False	False	False	False	True	False
10	Z	Q	False	False	False	False	True	False	False	False	False	False	False
11	Z	Q	False	False	False	False	False	True	False	False	False	False	True
12	Z	Q	False	False	False	False	False	True	False	False	False	False	True
13	Z	Q	False	False	False	False	False	True	False	False	False	False	True
14	Z	Q	False	False	False	False	False	True	False	False	False	False	True

	Question_ID	Answer_ID	Section_ID	Person_Incharge
0	AD	BB	X	Y
1	AD	BB	X	Y
2	AD	BB	X	Y
3	HG	GG	X	Y
4	HG	GG	X	Y
5	JI	UI	X	Y
6	JI	UI	X	Y
7	JI	UI	X	Y
8	JI	UI	X	Y
9	OP	LL	Z	Q
10	RT	TT	Z	Q
11	ER	RR	Z	Q
12	ER	RR	Z	Q
13	ER	RR	Z	Q
14	ER	RR	Z	Q

	Question_ID	Answer_ID	Section_ID	Person_Incharge
0	AD	BB	X	Y
1	AD	BB	X	Y
2	AD	BB	X	Y
3	HG	GG	X	Y
4	HG	GG	X	Y
5	JI	UI	X	Y
6	JI	UI	X	Y
7	JI	UI	X	Y
8	JI	UI	X	Y
9	OP	LL	Z	Q
10	RT	TT	Z	Q
11	ER	RR	Z	Q
12	ER	RR	Z	Q
13	ER	RR	Z	Q
14	ER	RR	Z	Q

	Question_ID	Answer_ID	Section_ID	Person_Incharge
0	AD	BB	X	Y
1	AD	BB	X	Y
2	AD	BB	X	Y
3	HG	GG	X	Y
4	HG	GG	X	Y
5	JI	UI	X	Y
6	JI	UI	X	Y
7	JI	UI	X	Y
8	JI	UI	X	Y
9	OP	LL	Z	Q
10	RT	TT	Z	Q
11	ER	RR	Z	Q
12	ER	RR	Z	Q
13	ER	RR	Z	Q
14	ER	RR	Z	Q