pdpatch – transformer

DataFrameTransformer

 DataFrameTransformer (transformer=None, input_cols=None,
                       output_cols=None, prev_step=None, append=False,
                       print_input_cols=False, print_output_cols=False,
                       print_out_df_cols=False)

Applies a transformer to a set of columns of pandas DataFrame and it outputs a DataFrame too.

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})
X

	city	title	expert_rating	user_rating
0	London	His Last Bow	5	4
1	London	How Watson Learned the Trick	3	5
2	Paris	A Moveable Feast	4	4
3	Sallisaw	The Grapes of Wrath	5	3

The OneHotEncoder expects a two dimensional array as input, so we set the input_cols to a list of columns. DataFrameTransformer uses the

enc_city = DataFrameTransformer(transformer=OneHotEncoder(dtype='int'),
                                input_cols=['city'],
                                append=True)
enc_city.fit_transform(X)

	city	title	expert_rating	user_rating	city_London	city_Paris	city_Sallisaw
0	London	His Last Bow	5	4	1	0	0
1	London	How Watson Learned the Trick	3	5	1	0	0
2	Paris	A Moveable Feast	4	4	0	1	0
3	Sallisaw	The Grapes of Wrath	5	3	0	0	1

CountVectorizer expects a one-dimensional array as input so we set input_cols to a string that will retrieve a one-dimensional array from the input DataFrame.

enc_title = DataFrameTransformer(transformer=CountVectorizer(), input_cols='title', append=True)
enc_title.fit_transform(X)

	city	title	expert_rating	user_rating	bow	feast	grapes	his	how	last	learned	moveable	of	the	trick	watson	wrath
0	London	His Last Bow	5	4	1	0	0	1	0	1	0	0	0	0	0	0	0
1	London	How Watson Learned the Trick	3	5	0	0	0	0	1	0	1	0	0	1	1	1	0
2	Paris	A Moveable Feast	4	4	0	1	0	0	0	0	0	1	0	0	0	0	0
3	Sallisaw	The Grapes of Wrath	5	3	0	0	1	0	0	0	0	0	1	1	0	0	1

We can chain these two into one Pipeline.

pipe = Pipeline([('enc_city', enc_city), ('enc_title', enc_title)])
pipe.fit_transform(X)

	city	title	expert_rating	user_rating	city_London	city_Paris	city_Sallisaw	bow	feast	grapes	his	how	last	learned	moveable	of	the	trick	watson	wrath
0	London	His Last Bow	5	4	1	0	0	1	0	0	1	0	1	0	0	0	0	0	0	0
1	London	How Watson Learned the Trick	3	5	1	0	0	0	0	0	0	1	0	1	0	0	1	1	1	0
2	Paris	A Moveable Feast	4	4	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0
3	Sallisaw	The Grapes of Wrath	5	3	0	0	1	0	0	1	0	0	0	0	0	1	1	0	0	1