Outliers and Other Fragmented Works

Practice builds experience. A few weeks ago, I finished several very fragmented works, which needs to be summerized. First is outliers detection. Until joining in House Prediction Competiton did I realized how important outliers handling is! At first I recode outliers (I define outliers by histogram) to a high quantile. I gradually found this is wrong. It will generate accumulation of special values. Then, it does no good to prediction. I posted this question and read others tutorial notebooks. I found using binary distibution plot (scatter plot) is helpful. Those points who are opposite to main distribution should be dropped. It really improve my LB ranking.

However, there’s no free lunch, dropping samples will cause sample size reduction. The first method will not lose samples. This is a trade-off. Another way for tackling outliers is IsolationForest, but I haven’t tried it.

A post written by Nina Zumel focuses on Bayesian approach is really easy to understand. It introduce a simple framework of Bayesian inference. Although the data is simple, it reminds me that

Common statistical tests are linear models

written by Jonas Kristoffer Lindeløv. The simple framework could be generilzed to regression analysis.

Some papers written by big professors not teaching you technic, but improve your data IQ. The first one is Statistical Modeling: The Two Cultures by Leo Brieman. Others could be found at this article’s reference list. Those papers are great help to my thesis.

After finishing a text classification task (a semi-intern), I found it’s important to make columns 1:1 corresponding between train set and test set, or the result will be totally messed up. I write this:

def consist_train_test(test):
    make test set columns corresponding to train set
    new_df = pd.DataFrame()
    for i in train_col: # train_col = train.columns
        if i in test.columns:
            new_df[i] = test[i]
            new_df[i] = 0
    new_df.fillna(0, inplace = True)
    order = train_col
    new_df[order] # order data
    return new_df

This actually a practical trick. Some algorithms will throw an error if you don’t do this. Similarly, I solved this problem at my shiny app.