Yi's Knowledge Base

Naive Bayes

Wed, 03 Feb 2021 15:55:00 -0400

We talk about one of the simplest classification methods, naive Bayes classifiers, and its applications in text classification. It’s not really machine learning as we only need a single pass through the data to compute necessary values.

Introduction to Bayesian Statistics

Wed, 13 Jan 2021 10:20:00 -0400

First lecture of the course, and a brief history of Bayesian statistics.

Introduction

Fri, 28 Aug 2020 19:05:11 -0400

We introduce some basic ideas of time series analysis and stochastic processes. Of particular importance are the concepts of stationarity and the autocovariance and sample autocovariance functions.

Matrices

Wed, 26 Aug 2020 15:14:34 -0400

Matrix algebra plays an important role in many areas of statistics, such as linear statistical models and multivariate analysis. In this chapter we introduce basic terminology and some basic matrix operations. We also introduce some basic types of matrices.

Estimation

Mon, 30 Sep 2019 13:46:57 -0400

In this chapter we introduce the concept of linear models. We use the ordinary least squares estimator to get unbiased estimates of the unknown parameters. $R^2$ is introduced as a measure of the goodness of fit, and the different types of sum of squares in a linear model are briefly discussed.

Basic Concepts

Wed, 25 Sep 2019 11:05:06 -0500

Introducing the concept of the probability of an event. Also covers set operations and the sample-point method.

Basic Concepts

Fri, 25 Jan 2019 22:50:34 -0500

A brief introduction to what we’re going to discuss in later chapters.

Conditional Probability

Thu, 26 Sep 2019 11:51:56 -0500

Introducing conditional probability and independence of events. Bayes’ rule comes in as well.

Fundamentals of Nonparametric Methods

Fri, 25 Jan 2019 22:50:34 -0500

Some basic tools such as the permutation test and the binomial test. We also introduce order statistics and ranks, which will come in handy in later chapters.

Text Classification with Naive Bayes and NLTK

Tue, 09 Feb 2021 15:55:00 -0400

In the last post we talked about the theoretical side of naive Bayes in text classification. Here we will implement the model in Python, both from scratch and utilizing existing packages. The corpus we use is a 26-line poem by T.S. Eliot. In each line a dummy string “ZZZ” or “XXX’ has been inserted, representing the class of the line (“ZZZ” for class 0 and XXX for class 1). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 corpus = [ "And indeed there will be time ZZZ", "For the yellow smoke that slides along the street XXX", "Rubbing its back upon the window-panes ZZZ", "There will be time, there will be time ZZZ", "To prepare a face to meet the faces that you meet XXX", "There will be time to murder and create ZZZ", "And time for all the works and days of hands ZZZ", "That lift and drop a question on your plate ZZZ", "Time for you and time for me ZZZ", "And time yet for a hundred indecisions XXX", "And for a hundred visions and revisions XXX", "Before the taking of a toast and tea ZZZ.

Frequentist Inference

Mon, 18 Jan 2021 10:20:00 -0400

A simple problem in the binomial setting solved under the frequentist view of statistics.

Linear Dependence and Independence

Mon, 31 Aug 2020 12:33:24 -0400

A short piece on linearly dependent and independent sets of vectors.

Autoregressive Series

Fri, 28 Aug 2020 20:31:05 -0400

We talk about autoregressive models of different orders, and introduce their mean, variance, ACF and PACF values. Its stationarity is also briefly discussed.

Definitions for Discrete Random Variables

Sun, 06 Oct 2019 10:46:18 -0400

The probability mass function, cumulative distribution function, expectation and variance for random variables.

Location Inference for Single Samples

Tue, 26 Mar 2019 21:12:45 -0500

The Wilcoxin signed rank test explained.

Moving Average Model

Fri, 04 Sep 2020 20:31:13 -0400

The mean, variance, ACF and PACF of moving average models. Instead of stationarity, a new property called invertibility is introduced.

Common Discrete Random Variables

Sun, 06 Oct 2019 10:46:18 -0400

We introduce the binomial (Bernoulli), geometric and Poisson probability distributions and their properties. The properties include their expectations, variances and moment generating functions.

Other Single Sample Inferences

Fri, 26 Apr 2019 23:45:36 -0500

Explore whether the sample is consistent with a specified distribution at the population level. Kolmogorov’s test, Lilliefors test and Shapiro-Wilk test are introduced, as well as tests for runs or trends.

ARMA Model

Sat, 12 Sep 2020 20:31:18 -0400

The mean, variance, ACF and PACF of ARMA models. The backshift operator is introduced, and the stationarity and invertibility of the general ARMA(p, q) model is discussed.

Gradient Descent and Linear Regression

Thu, 11 Feb 2021 15:55:00 -0400

We implement linear regression using gradient descent, a general optimization technique which in this case can find the global minimum.

Bayesian Inference for the Binomial Model

Mon, 25 Jan 2021 10:20:00 -0500

The general procedure for Bayesian analysis. We use two different prior models and compare the resulting posteriors (visually and mathematically).

Model Fitting and Forecasting

Mon, 14 Sep 2020 12:06:41 -0400

This model-building strategy consists of three steps: model specification (identification), model fitting, and model diagnostics.

Vector Space

Mon, 31 Aug 2020 13:13:34 -0400

We introduce some basic terminology - vector space, subspace, span, basis, and dimension. These concepts lay the foundation for future discussions on matrices and matrix properties.

Definitions for Continuous Random Variables

Wed, 25 Sep 2019 10:46:18 -0400

The probability density function, cumulative distribution function, expectation and variance for a continuous random variable.

Methods for Paired Samples

Mon, 29 Apr 2019 14:22:47 -0400

An obvious extension of the one-sample procedures.

Common Continuous Random Variables

Fri, 01 Nov 2019 10:46:18 -0400

The uniform distribution, normal distribution, exponential distribution and their properties.

Two Independent Samples

Thu, 02 May 2019 12:09:42 -0400

With two independent samples, we may ask about the centrality of the population distribution and see if there’s a shift. Wilcoxon-Mann-Whitney is here!

Basic Tests for Three or More Samples

Sat, 04 May 2019 12:09:42 -0400

Nonparametric analogues of the one-way classification ANOVA and the simplest two-way classifications, namely the Kruskal-Wallis test, the Jonckheere-Terpstra test, and the Friedman test.

Logistic Regression

Fri, 12 Feb 2021 15:55:00 -0400

In linear regression, the function learned is used to estimate the value of the target $y$ using values of input $x$. While it could be used for classification purposes by setting the target value to a distinct constant for each class, it’s a poor choice for this task. The target attribute takes on a finite number of values, yet the linear model produces a continuous range. For classification tasks, logistic regression is a better choice.

Bayesian Inference for the Poisson Model

Mon, 08 Feb 2021 10:20:00 -0500

This lecture discusses Bayesian inference for the Poisson model, including conjugate prior specification, a different way to specify a “non-informative” prior, and relevant posterior summaries.

Multivariate Probability Distributions

Wed, 06 Nov 2019 09:57:16 -0400

Joint probability distributions of two or more random variables defined on the same sample space. Also covers independence, conditional expectation and total expectation.

Mean Trend

Wed, 30 Sep 2020 11:30:42 -0400

We introduce detrending and differencing, two methods that aim to remove the mean trends in time series.

Definitions in Arbitrary Linear Space

Mon, 14 Sep 2020 20:28:58 -0400

This chapter provides an introduction to some fundamental geometrical ideas and results. We start by giving definitions for norm, distance, angle, inner product and orthogonality. The Cauchy-Schwarz inequality comes useful in many settings.

Correlation and Concordance

Sun, 05 May 2019 10:46:18 -0400

Measures for the strength of relationships between variables (two or more). The Spearman rank correlation coefficient, Kendall’s tau and Kendall’s W are introduced.

ARIMA Models

Mon, 05 Oct 2020 11:31:11 -0400

Combining differencing and ARMA models and we get ARIMA. The procedures of estimation, diagnosis and forecasting are very similar as that of ARMA models.

Projection

Mon, 21 Sep 2020 20:38:19 -0400

Geometrically speaking, what is the projection of a vector onto another vector, and the projection of a vector onto a subspace?

Categorical Data

Mon, 06 May 2019 10:46:18 -0400

Dealing with contingency tables. Fisher’s exact test comes back, together with Chi-squared test and likelihood-ratio test. We also talk about testing goodness-of-fit.

Unit Root Test

Wed, 07 Oct 2020 16:42:23 -0400

A test that helps us determine whether differencing is needed or not. We also talk about over-differencing (don’t do it!) and model selection (AIC/BIC and MAPE).

Orthogonalization

Mon, 28 Sep 2020 21:55:43 -0400

Introducing the Gram-Schmidt process, a method for constructing an orthogonal basis given a non-orthogonal basis.

Variability of Nonstationary Time Series

Fri, 09 Oct 2020 11:30:59 -0400

Using the Box-Cox power transformation to stabilize the variance. At the end of this section, the standard procedure for fitting an ARIMA model is discussed.

Monte Carlo Sampling

Mon, 15 Feb 2021 10:20:00 -0500

This lecture discusses Monte Carlo approximations of the posterior distribution and summaries from it. While this might not seem entirely useful now, this underlies some of the key computational methods used for Bayesian inference that we will discuss further.

Seasonal Time Series

Tue, 13 Oct 2020 23:36:46 -0400

We introduce seasonal differencing, seasonal ARMA models, and combine them to get SARIMA models.

Linear Space of Matrices

Wed, 30 Sep 2020 13:23:18 -0400

The column space, row space and rank of a matrix and their properties.

Functions of Random Variables

Sun, 08 Dec 2019 09:57:16 -0400

Finding the distribution of a real-valued function of multiple random variables. There’s the method of distribution functions, transformations and moment generating functions.

Bootstrap

Mon, 06 May 2019 10:46:18 -0400

The procedure and applications of the nonparametric bootstrap.

Density Estimation

Mon, 06 May 2019 10:46:18 -0400

Wanna know more about histograms and density plots?

Modern Nonparametric Regression

Wed, 08 May 2019 10:46:18 -0400

LOWESS, penalized least squares and the cubic spline.

Bayesian Inference for the Normal Model

Mon, 22 Feb 2021 10:20:00 -0500

The normal distribution has two parameters, but we focus on the one-parameter setting in this lecture. We also introduce the posterior predictive check as a way to assess model fit, and briefly discuss the issue with improper prior distributions.

Decomposition and Smoothing Methods

Mon, 02 Nov 2020 11:12:20 -0500

Decomposition procedures to extract trend, seasonal and other components from a time series. Smoothing techniques like moving average and Lowess are often used, and exponential smoothing (Holt-Winters) is another powerful tool.

Matrix Trace

Wed, 07 Oct 2020 13:07:16 -0400

Such a simple concept with so many properties and applications!

Sampling Distribution and Limit Theorems

Sat, 28 Dec 2019 09:57:16 -0400

We observe a random sample from a probability distribution of interest and want to estimate its properties. The CLT also comes into place.

The Normal Model in a Two Parameter Setting

Mon, 15 Mar 2021 10:20:00 -0500

This lecture discusses Bayesian inference of the normal model, particularly the case where we are interested in joint posterior inference of the mean and variance simultaneously. We discuss approaches to prior specification, and introduce the Gibbs sampler as a way to generate posterior samples if full conditional distributions of the parameters are available in closed-form.

Spectral Analysis

Tue, 17 Nov 2020 15:12:15 -0500

We talk about a method that helps us find the periodicity of a time series – the spectral density.

Matrix Inverse

Mon, 12 Oct 2020 12:42:03 -0400

…for a nonsingular matrix. We talk about left and right inverses, the matrix inverse and orthogonal matrices.

Brief Review Before STAT 6520

Wed, 08 Jan 2020 09:57:16 -0400

A brief review of probability theory and statistics we’ve learnt so far.

Metropolis-Hastings Algorithms

Mon, 22 Mar 2021 10:20:00 -0500

This lecture discusses the Metropolis and Metropolis-Hastings algorithms, two more tools for sampling from the posterior distribution when we do not have it in closed form. These are used when we are unable to obtain full conditional distributions. MCMC for the win!

Conditional Heteroscedastic Models

Mon, 23 Nov 2020 18:22:39 -0500

Introducing volatility to our time series models. The properties and building procedures of ARCH and GARCH models are discussed.

Generalized Inverse

Wed, 21 Oct 2020 12:45:56 -0400

The generalized matrix inverse that applies to any $m \times n$ matrix.

Bias and Variance

Sat, 25 Jan 2020 10:46:18 -0400

The bias, variance and mean squared error of an estimator. The efficiency is used to compare two estimators.

Consistency

Mon, 27 Jan 2020 10:46:18 -0400

Introducing consistency, a concept about the convergence of estimators. We start from the convergence of non-random number sequences to convergence in probability, then to consistency of estimators and its properties.

The Method of Moments

Tue, 28 Jan 2020 10:46:18 -0400

A fairly simple method of constructing estimators that’s not often used now.

Hierarchical Models

Mon, 29 Mar 2021 10:20:00 -0500

This model is useful for accommodating data which are grouped or having multiple levels, as the main feature is the addition of a between-group layer which relates groups to each other. The presence of this layer forces group-level parameters to be more similar to each other, displaying the important properties of partial pooling and shrinkage.

Projection Matrix

Fri, 23 Oct 2020 13:21:22 -0400

We introduce idempotent matrices and the projection matrix. Both are very important concepts in statistical analyses such as linear regression.

Maximum Likelihood Estimator

Wed, 29 Jan 2020 10:46:18 -0400

Under parametric family distributions, there’s a much better way of constructing estimators - the maximum likelihood estimator.

Sufficiency

Thu, 30 Jan 2020 10:46:18 -0400

Introducing sufficient statistics for the inference of parameters. The factorization theorem comes in handy!

Optimal Unbiased Estimator

Sun, 02 Feb 2020 10:46:18 -0400

Introducing the Minimum Variance Unbiased Estimator and the procedure of deriving it.

Bayesian Linear Regression

Mon, 05 Apr 2021 10:20:00 -0500

The main difference with traditional approaches is in the specification of prior distributions for the regression parameters, which relate covariates to a continuous response variable. However, the Bayesian approach also provides a fairly intuitive way to add random effects (such as a random intercept or random slope), which results in what is traditionally known as a linear mixed model.

Determinant

Mon, 26 Oct 2020 13:25:26 -0400

The determinant is a very important concept for square matrices, and its properties are key to various other notions such as block matrices and matrix inverses.

Confidence Intervals

Sat, 08 Feb 2020 09:57:16 -0400

Confidence intervals and methods of contructing them.

Penalized Linear Regression and Model Selection

Mon, 12 Apr 2021 10:20:00 -0500

This lecture covers some Bayesian connections to penalized regression methods such as ridge regression and the LASSO. Further discussion of the posterior predictive distribution as well as model selection criterion (DIC) is included.

Quadratic Form

Wed, 04 Nov 2020 16:45:31 -0500

This long post covers the quadratic form and the positive definiteness of matrices. The decomposition of symmetric matrices is slightly touched on, and the entire post is mainly to prepare for the next chapter – eigenvalues and eigenvectors.

Statistical Decision

Wed, 01 Apr 2020 16:55:50 -0400

Up till now we’ve made the assumption that the data is generated from a statistical model controlled by some parameter(s). We used estimation to determine a point or a range of possible values of parameters based on the sample. On the other hand, the goal of data analysis is often to help make decisions, which is not directly addressed by estimation. Drug approval example Suppose a new drug can be approved only with $\geq 90%$ effective rate.

Statistical Test

Wed, 01 Apr 2020 16:55:50 -0400

Here we introduce the elements of a statistical test, namely null and alternative hypotheses, test statistic, rejection region, and type I and type II errors. We then proceed to large-sample Z-tests and some small-sample tests derived from the small sample CIs.

p-values

Thu, 09 Apr 2020 18:26:34 -0400

Introducing the definition of p-values, and why they are important in statistical tests.

Optimal Tests

Tue, 14 Apr 2020 18:26:34 -0400

Briefly introducing the optimality of a statistical test and showing why it’s a difficult problem to solve.

Likelihood Ratio Test

Sat, 18 Apr 2020 22:15:07 -0400

In the previous section, we considered the situation where we Test $H_0$: $\theta = \theta_0$ vs. $H_a$: $\theta = \theta_a$ using rejection rule $\frac{L(\theta_0)}{L(\theta_a)} < k_\alpha$. Test $H_0$: $\theta = \theta_0$ vs. $H_a$: $\theta \in \Theta_a$ (typically one-sided) using the rejection rule $\frac{L(\theta_0)}{L(\theta_a)} < k_\alpha$ if it does not depend on $\theta \in \Theta_a$. Beyond these situations, there’s many other cases, such as What if $H_0: \theta \in \Theta_0$ is composite?

Bayesian Generalized Linear Models

Mon, 19 Apr 2021 10:20:00 -0500

This lecture discusses a simple logistic regression model for predicting a binary variable. GLMs are necessary when the response variable cannot be modeled appropriately by a normal distribution, and use a link function to connect parameters of the response distribution to the covariates.

Eigenvalues and Eigenvectors

Wed, 18 Nov 2020 18:45:52 -0500

Probably the most important lecture in this course – we start from the calculation of eigenvalues and eigenvectors, and move on to related topics such as the eigendecomposition, singular value decomposition, and the Moore-Penrose inverse.

Linear Models

Tue, 21 Apr 2020 13:05:20 -0400

So far we’ve finished the main materials of this course - estimation and hypothesis testing. The starting point of all the statistical analyses is really modeling. In other words, we assume that our data are generated by some random mechanism, specifically we’ve been focusing on i.i.d. samples from a fixed population distribution. Although this assumption can be regarded reasonable for many applications, in practice there are other scenarios where this doesn’t make sense, e.

A Bayesian Perspective on Missing Data Imputation

Mon, 26 Apr 2021 10:20:00 -0500

This lecture discusses some approaches to handling missing data, primarily when missingness occurs completely randomly. We discuss a procedure, MICE, which uses Gibbs sampling to create multiple “copies” of filled-in datasets.

Are we there yet? A machine learning architecture to predict organotropic metastases

Wed, 24 Nov 2021 00:00:00 +0000

Cancer metastasis into distant organs is an evolutionarily selective process. A better understanding of the driving forces endowing proliferative plasticity of tumor seeds in distant soils is required to develop and adapt better treatment systems for this lethal stage of the disease. To this end, we aimed to utilize transcript expression profiling features to predict the site-specific metastases of primary tumors and second, to identify the determinants of tissue specific progression.

Molecular identification of protein kinase C beta in Alzheimer's disease

Sun, 15 Nov 2020 00:00:00 +0000

The purpose of this study was to investigate the potential roles of protein kinase C beta (PRKCB) in the pathogenesis of Alzheimer’s disease (AD). We identified 2,254 differentially expressed genes from 19,245 background genes in AD versus control as well as PRKCB-low versus high group. Five co-expression modules were constructed by weight gene correlation network analysis. Among them, the 1,222 genes of the turquoise module had the strongest relation to AD and those with low PRKCB expression, which were enriched in apoptosis, axon guidance, gap junction, Fc gamma receptor (FcγR)-mediated phagocytosis, mitogen-activated protein kinase (MAPK) and vascular endothelial growth factor (VEGF) signaling pathways.

About

Tue, 30 Jun 2020 00:00:00 +0000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 { "name": "Yi", "job": { "PhD student": "bioinformatics" }, "education": { "Master's": ["Statistics", "University of Georgia"], "Bachelor's": ["Biology", "China Agricultural University"] } "interests": [ "cancer", "systems biology", "metabolic reprogramming", "NLP", "data visualization", "graph neural networks" ], "skills": ["R", "Python", "Linux", "Docker"] } Savage’s approach to research via Mosteller (Hamada and Sitter, 2004):

Co-expression based cancer staging and application

Tue, 30 Jun 2020 00:00:00 +0000

A novel method is developed for predicting the stage of a cancer tissue based on the consistency level between the co-expression patterns in the given sample and samples in a specific stage. The basis for the prediction method is that cancer samples of the same stage share common functionalities as reflected by the co-expression patterns, which are distinct from samples in the other stages. Test results reveal that our prediction results are as good or potentially better than manually annotated stages by cancer pathologists.

Metabolic Reprogramming in Cancer: the bridge that connects intracellular stresses and cancer behaviors

Thu, 30 Apr 2020 00:00:00 +0000

We outline in this perspective a novel framework for cancer study from the angle of stress-induced metabolic reprogramming. The driving question is: what may dictate the same or highly similar evolutionary trajectory across different cancers, consisting of cell proliferation, drug resistance, migration and metastasis? We have observed that cancer and cancer-forming cells are under a persistent intracellular alkaline stress, due to chronic inflammation and local iron overload. A wide range of reprogrammed metabolisms (RMs) are induced to keep the intracellular pH within a livable range for survival.

Install Dependencies for Puppeteer on Manjaro Linux

Mon, 13 Apr 2020 21:31:56 -0400

Elucidation of Functional Roles of Sialic Acids in Cancer Migration

Tue, 31 Mar 2020 00:00:00 +0000

Sialic acids (SA), negatively charged nine-carbon sugars, have long been implicated in cancer metastasis since 1960’s but its detailed functional roles remain elusive. We present a computational analysis of transcriptomic data of cancer vs. control tissues of eight types in TCGA, aiming to elucidate the possible reason for the increased production and utilization of SAs in cancer and their possible driving roles in cancer migration. Our analyses have revealed for all cancer types:

Automatic and Interpretable Model for Periodontitis Diagnosis in Panoramic Radiographs

Sat, 14 Mar 2020 00:00:00 +0000

Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20% - 50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local community and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and proposing interpretable machine learning models is crucial for disease diagnosis.

Neural Functions Play Different Roles in Triple Negative Breast Cancer (TNBC) and non-TNBC

Thu, 20 Feb 2020 00:00:00 +0000

Triple negative breast cancer (TNBC) represents the most malignant subtype of breast cancer, and yet our understanding about its unique biology remains elusive. We have conducted a comparative computational analysis of transcriptomic data of TNBC and non-TNBC (NTNBC) tissue samples from the TCGA database, focused on genes involved in neural functions. Our main discoveries are: While both subtypes involve neural functions, TNBC has substantially more up-regulated neural genes than NTNBC, suggesting that TNBC is more complex than NTNBC; Non-neural functions related to cell-microenvironment interactions and intracellular damage processing are key inducers of the neural genes in both TNBC and NTNBC, but the inducer-responder relationships are different in the two cancer subtypes; Key neural functions such as neural crest formation are predicted to enhance adaptive immunity in TNBC while glia development, along with a few other neural functions, induce both innate and adaptive immunity in NTNBC.

Metabolic Reprogramming in Cancer is Induced to Increase Proton Production

Mon, 13 Jan 2020 00:00:00 +0000

Considerable metabolic reprogramming has been observed in a conserved manner across multiple cancer types, but their true causes remain elusive. We present an analysis of around 50 such reprogrammed metabolisms (RMs) including the Warburg effect, nucleotide de novo synthesis and sialic acid biosynthesis in cancer. Analyses of the biochemical reactions conducted by these RMs, coupled with gene expression data of their catalyzing enzymes, in 7,011 tissues of 14 cancer types, revealed that all RMs produce more H+ than their original metabolisms.

Transcription regulation by DNA methylation under stressful conditions in human cancer

Thu, 23 Nov 2017 00:00:00 +0000

We aim to address one question: do cancer vs. normal tissue cells execute their transcription regulation essentially the same or differently, and why? We utilized an integrated computational study of cancer epigenomes and transcriptomes of 10 cancer types, by using penalized linear regression models to evaluate the regulatory effects of DNA methylations on gene expressions.