Title: | Geometric Density Estimation |
---|---|
Description: | Provides the hybrid Bayesian method Geometric Density Estimation. On the one hand, it scales the dimension of our data, on the other it performs inference. The method is fully described in the paper "Scalable Geometric Density Estimation" by Y. Wang, A. Canale, D. Dunson (2016) <http://proceedings.mlr.press/v51/wang16e.pdf>. |
Authors: | Lorenzo Rimella |
Maintainer: | Lorenzo Rimella <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2024-11-05 03:25:40 UTC |
Source: | https://github.com/lorenzorimella/rgeode |
The decreasing function for the adptive puning.
p(t, c0, c1)
p(t, c0, c1)
t |
int |
c0 |
double |
c1 |
double |
p
returns the threshold of interest:
p(t) |
double |
L. Rimella, [email protected]
[1] A. Canale, D. Dunson, Y. Wang.
"Scalable Geometric Density Estimation" (2016).
(available at https://arxiv.org/abs/1410.7692).
The implementation of rgammatr is inspired to the Matlab
implementation of rexptrunc by Ye Wang.
t = 10 c0= -1 c1= 10 p(t, c0, c1)
t = 10 c0= -1 c1= 10 p(t, c0, c1)
Compute the near-optimal low-rank singular value decomposition (SVD) of a rectangular matrix. The algorithm follows a randomized approach.
randSVD(A, k = NULL, l = NULL, its = 2, sdist = "unif")
randSVD(A, k = NULL, l = NULL, its = 2, sdist = "unif")
A |
array_like |
k |
int, optional |
l |
int, optional |
its |
int, optional |
sdist |
str |
Randomized SVD (randSVD) is a fast algorithm to compute the approximate low-rank SVD of
a rectangular matrix
using a probabilistic algorithm.
Given the decided rank
,
rSVD
factors the input matrix as
, which is the typical SVD form. Precisely, the columns of
U are orthonormal, as are the columns of V, the entries of S are all nonnegative,
and the only nonzero entries of S appear in non-increasing order on its diagonal. The
dimensions are: U is
, V is
, and S is
, when A
is
.
Increasing or
improves the accuracy of the approximation USV' to A.
The parameter specifies the number of normalized power iterations
(subspace iterations) to reduce the approximation error. This is recommended
if the the singular values decay slowly. In practice 1 or 2 iterations
achieve good results, however, computing power iterations increases the
computational time. The number of power iterations is set to
by default.
randSVD
returns a list containing the following three components:
d |
array_like |
u |
matrix |
v |
matrix |
The singular vectors are not unique and only defined up to sign (a constant of modulus one in the complex case). If a left singular vector has its sign changed, changing the sign of the corresponding right vector gives an equivalent decomposition.
L. Rimella, [email protected]
[1] N. Halko, P. Martinsson, and J. Tropp.
"Finding structure with randomness: probabilistic
algorithms for constructing approximate matrix
decompositions" (2009).
(available at arXiv http://arxiv.org/abs/0909.4061).
[2] S. Voronin and P.Martinsson.
"RSVDPACK: Subroutines for computing partial singular value
decompositions via randomized sampling on single core, multi core,
and GPU architectures" (2015).
(available at 'arXiv http://arxiv.org/abs/1502.05366).
[3] N. Benjamin Erichson.
"Randomized Singular Value Decomposition (rsvd): R package" (2016).
(available in the CRAN).
[4] Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp.
"Finding structure with randomness: Stochastic algorithms for
constructing approximate matrix decompositions" (2009).
(available at http://arxiv.org).
[5] V. Rokhlin, A. Szlam, M. Tygert.
"A randomized algorithm for principal component analysis" (2009).
(available at https://arxiv.org/abs/0809.2274).
The implementation of rand SVD is inspired by the MatLab implementation
of RandPCA by M. Tygert.
#Simulate a general matrix with 1000 rows and 1000 columns vy= rnorm(1000*1000,0,1) y= matrix(vy,1000,1000,byrow=TRUE) #Compute the randSVD for the first hundred components of the matrix y and measure the time start.time <- Sys.time() prova1= randSVD(y,k=100) Sys.time()- start.time #Compare with a classical SVD start.time <- Sys.time() prova2= svd(y) Sys.time()- start.time
#Simulate a general matrix with 1000 rows and 1000 columns vy= rnorm(1000*1000,0,1) y= matrix(vy,1000,1000,byrow=TRUE) #Compute the randSVD for the first hundred components of the matrix y and measure the time start.time <- Sys.time() prova1= randSVD(y,k=100) Sys.time()- start.time #Compare with a classical SVD start.time <- Sys.time() prova2= svd(y) Sys.time()- start.time
Simulate random number from a truncated Exponential distribution.
rexptr(n = 1, lambda = 1, range = NULL)
rexptr(n = 1, lambda = 1, range = NULL)
n |
int, optional |
lambda |
double, optional |
range |
array_like, optional |
It provide a way to simulate from a truncated Exponential
distribution with given pameter and the range
.
This will be used during the posterior sampling in th Gibbs sampler.
rexptr
returns the simulated value of the
distribution:
u |
double |
L. Rimella, [email protected]
[1] Y. Wang, A. Canale, D. Dunson.
"Scalable Geometric Density Estimation" (2016).
The implementation of rgammatr is inspired to the Matlab
implementation of rexptrunc by Ye Wang.
#Simulate a truncated Exponential with parameters 0.5 in the range #5,Inf. #Set the range: range<- c(1,Inf) #Simulate the truncated Gamma set.seed(123) vars1<-rexptr(1000,0.5,range) #Look at the histogram hist(vars1,freq=FALSE,ylim=c(0,2),xlim = c(0,5)) lines(density(vars1)) #Compare with a non truncated Exponential set.seed(123) vars2<-rexp(1000,0.5) #Compare the two results lines(density(vars2),col='red') #Observation: simulate without range is equivalent to simulate from #rexp(1000,0.5)
#Simulate a truncated Exponential with parameters 0.5 in the range #5,Inf. #Set the range: range<- c(1,Inf) #Simulate the truncated Gamma set.seed(123) vars1<-rexptr(1000,0.5,range) #Look at the histogram hist(vars1,freq=FALSE,ylim=c(0,2),xlim = c(0,5)) lines(density(vars1)) #Compare with a non truncated Exponential set.seed(123) vars2<-rexp(1000,0.5) #Compare the two results lines(density(vars2),col='red') #Observation: simulate without range is equivalent to simulate from #rexp(1000,0.5)
Simulate random number from a truncated Gamma distribution.
rgammatr(n = 1, A = 0, B = 1, range = NULL)
rgammatr(n = 1, A = 0, B = 1, range = NULL)
n |
int, optional |
A |
double, optional |
B |
double, optional |
range |
array_like, optional |
It provide a way to simulate from a truncated Gamma distribution with
given pameters and range
. This will be used
during the posterior sampling in th Gibbs sampler.
rgammatr
returns the simulated value of the distribution:
u |
double |
L. Rimella, [email protected]
[1] Y. Wang, A. Canale, D. Dunson.
"Scalable Geometric Density Estimation" (2016).
The implementation of rgammatr is inspired to the Matlab
implementation of gamrndtruncated by Ye Wang.
#Simulate a truncated Gamma with parameters 1,2 in the range #1,5. #Set the range: range<- c(1,5) #Simulate the truncated Gamma set.seed(123) vars1<-rgammatr(1000,1,2,range) #Look at the histogram hist(vars1,freq=FALSE,ylim=c(0,2),xlim = c(0,5)) lines(density(vars1)) #Compare with a non truncated Gamma set.seed(123) vars2<-rgamma(1000,1,2) #Compare the two results lines(density(vars2),col='red') #Observation: simulate without range is equivalent to simulate from #rgamma(1000,1,2)
#Simulate a truncated Gamma with parameters 1,2 in the range #1,5. #Set the range: range<- c(1,5) #Simulate the truncated Gamma set.seed(123) vars1<-rgammatr(1000,1,2,range) #Look at the histogram hist(vars1,freq=FALSE,ylim=c(0,2),xlim = c(0,5)) lines(density(vars1)) #Compare with a non truncated Gamma set.seed(123) vars2<-rgamma(1000,1,2) #Compare the two results lines(density(vars2),col='red') #Observation: simulate without range is equivalent to simulate from #rgamma(1000,1,2)
It selects the principal directions of the data and performs inference. Moreover GEODE is also able to handle missing data.
rgeode(Y, d = 6, burn = 1000, its = 2000, tol = 0.01, atau = 1/20, asigma = 1/2, bsigma = 1/2, starttime = NULL, stoptime = NULL, fast = TRUE, c0 = -1, c1 = -0.005)
rgeode(Y, d = 6, burn = 1000, its = 2000, tol = 0.01, atau = 1/20, asigma = 1/2, bsigma = 1/2, starttime = NULL, stoptime = NULL, fast = TRUE, c0 = -1, c1 = -0.005)
Y |
array_like |
d |
int, optional |
burn |
int, optional |
its |
int, optional |
tol |
double, optional |
atau |
double, optional |
asigma |
double, optional |
bsigma |
double, optional |
starttime |
int, optional |
stoptime |
int, optional |
fast |
bool, optional |
c0 |
double, optional |
c1 |
double, optional |
GEOmetric Density Estimation (rgeode) is a fast algorithm performing inference on normally distributed data. It is essentially divided in two principal steps:
Selection of the principal axes of the data.
Adaptive Gibbs sampler with the creation of a set of samples from the full conditional posteriors of the parameters of interest, which enable us to perform inference.
It takes in inputs several quantities. A rectangular
matrix
, on which we will run a Fast rank
SVD. The conservative upper bound of the true dimension
of our data
. A set of tuning parameters. We remark that
the choice of the conservative upper bound
must be such
that
, with
real dimension, and
.
rgeode
returns a list containing the following
components:
InD |
array_like |
u |
matrix |
tau |
matrix |
sigmaS |
array_like |
W |
matrix |
Miss |
list
|
The part related to the missing data is filled only in the case in which we have missing data.
L. Rimella, [email protected]
[1] Y. Wang, A. Canale, D. Dunson.
"Scalable Geometric Density Estimation" (2016).
library(MASS) library(RGeode) #################################################################### # WITHOUT MISSING DATA #################################################################### # Define the dataset D= 200 n= 500 d= 10 d_true= 3 set.seed(321) mu_true= runif(d_true, -3, 10) Sigma_true= matrix(0,d_true,d_true) diag(Sigma_true)= c(runif(d_true, 10, 100)) W_true = svd(matrix(rnorm(D*d_true, 0, 1), d_true, D))$v sigma_true = abs(runif(1,0,1)) mu= W_true%*%mu_true C= W_true %*% Sigma_true %*% t(W_true)+ sigma_true* diag(D) y= mvrnorm(n, mu, C) ################################ # GEODE: Without missing data ################################ start.time <- Sys.time() GEODE= rgeode(Y= y, d) Sys.time()- start.time # SIGMAS #plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2, # xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2') #abline(v=800,lwd= 2, col= 'blue') #legend('bottomright',c('Posterior of sigma^2', 'Stopping time'), # lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3) #################################################################### # WITH MISSING DATA #################################################################### ########################### #Insert NaN n_m = 5 #number of data vectors containing missing features d_m = 1 #number of missing features data_miss= sample(seq(1,n),n_m) features= sample(seq(1,D), d_m) for(i in 2:n_m) { features= rbind(features, sample(seq(1,D), d_m)) } for(i in 1:length(data_miss)) { if(i==length(data_miss)) { y[data_miss[i],features[i,][-1]]= NaN } else { y[data_miss[i],features[i,]]= NaN } } ################################ # GEODE: With missing data ################################ set.seed(321) start.time <- Sys.time() GEODE= rgeode(Y= y, d) Sys.time()- start.time # SIGMAS #plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2, # xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2') #abline(v=800,lwd= 2, col= 'blue') #legend('bottomright',c('Posterior of sigma^2', 'Stopping time'), # lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3) #################################################################### ####################################################################
library(MASS) library(RGeode) #################################################################### # WITHOUT MISSING DATA #################################################################### # Define the dataset D= 200 n= 500 d= 10 d_true= 3 set.seed(321) mu_true= runif(d_true, -3, 10) Sigma_true= matrix(0,d_true,d_true) diag(Sigma_true)= c(runif(d_true, 10, 100)) W_true = svd(matrix(rnorm(D*d_true, 0, 1), d_true, D))$v sigma_true = abs(runif(1,0,1)) mu= W_true%*%mu_true C= W_true %*% Sigma_true %*% t(W_true)+ sigma_true* diag(D) y= mvrnorm(n, mu, C) ################################ # GEODE: Without missing data ################################ start.time <- Sys.time() GEODE= rgeode(Y= y, d) Sys.time()- start.time # SIGMAS #plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2, # xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2') #abline(v=800,lwd= 2, col= 'blue') #legend('bottomright',c('Posterior of sigma^2', 'Stopping time'), # lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3) #################################################################### # WITH MISSING DATA #################################################################### ########################### #Insert NaN n_m = 5 #number of data vectors containing missing features d_m = 1 #number of missing features data_miss= sample(seq(1,n),n_m) features= sample(seq(1,D), d_m) for(i in 2:n_m) { features= rbind(features, sample(seq(1,D), d_m)) } for(i in 1:length(data_miss)) { if(i==length(data_miss)) { y[data_miss[i],features[i,][-1]]= NaN } else { y[data_miss[i],features[i,]]= NaN } } ################################ # GEODE: With missing data ################################ set.seed(321) start.time <- Sys.time() GEODE= rgeode(Y= y, d) Sys.time()- start.time # SIGMAS #plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2, # xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2') #abline(v=800,lwd= 2, col= 'blue') #legend('bottomright',c('Posterior of sigma^2', 'Stopping time'), # lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3) #################################################################### ####################################################################