from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
DataFrameTransformer
DataFrameTransformer (transformer=None, input_cols=None, output_cols=None, prev_step=None, append=False, print_input_cols=False, print_output_cols=False, print_out_df_cols=False)
Applies a transformer to a set of columns of pandas DataFrame and it outputs a DataFrame too.
= pd.DataFrame(
X 'city': ['London', 'London', 'Paris', 'Sallisaw'],
{'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
X
city | title | expert_rating | user_rating | |
---|---|---|---|---|
0 | London | His Last Bow | 5 | 4 |
1 | London | How Watson Learned the Trick | 3 | 5 |
2 | Paris | A Moveable Feast | 4 | 4 |
3 | Sallisaw | The Grapes of Wrath | 5 | 3 |
The OneHotEncoder
expects a two dimensional array as input, so we set the input_cols
to a list of columns. DataFrameTransformer
uses the
= DataFrameTransformer(transformer=OneHotEncoder(dtype='int'),
enc_city =['city'],
input_cols=True)
append enc_city.fit_transform(X)
city | title | expert_rating | user_rating | city_London | city_Paris | city_Sallisaw | |
---|---|---|---|---|---|---|---|
0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 |
1 | London | How Watson Learned the Trick | 3 | 5 | 1 | 0 | 0 |
2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 |
3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 |
CountVectorizer
expects a one-dimensional array as input so we set input_cols
to a string that will retrieve a one-dimensional array from the input DataFrame
.
= DataFrameTransformer(transformer=CountVectorizer(), input_cols='title', append=True)
enc_title enc_title.fit_transform(X)
city | title | expert_rating | user_rating | bow | feast | grapes | his | how | last | learned | moveable | of | the | trick | watson | wrath | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | London | How Watson Learned the Trick | 3 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
We can chain these two into one Pipeline
.
= Pipeline([('enc_city', enc_city), ('enc_title', enc_title)])
pipe pipe.fit_transform(X)
city | title | expert_rating | user_rating | city_London | city_Paris | city_Sallisaw | bow | feast | grapes | his | how | last | learned | moveable | of | the | trick | watson | wrath | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | London | His Last Bow | 5 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | London | How Watson Learned the Trick | 3 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
2 | Paris | A Moveable Feast | 4 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | Sallisaw | The Grapes of Wrath | 5 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |