Predict a vector of values with constraints? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?
What's the purpose of writing one's academic bio in 3rd person?
Is there a "higher Segal conjecture"?
How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?
Models of set theory where not every set can be linearly ordered
I am not a queen, who am I?
Is high blood pressure ever a symptom attributable solely to dehydration?
"Seemed to had" is it correct?
How to bypass password on Windows XP account?
Should I discuss the type of campaign with my players?
Disable hyphenation for an entire paragraph
Do you forfeit tax refunds/credits if you aren't required to and don't file by April 15?
Is above average number of years spent on PhD considered a red flag in future academia or industry positions?
When to stop saving and start investing?
How can players work together to take actions that are otherwise impossible?
Does surprise arrest existing movement?
Examples of mediopassive verb constructions
Is the Standard Deduction better than Itemized when both are the same amount?
Is there a Spanish version of "dot your i's and cross your t's" that includes the letter 'ñ'?
Why are there no cargo aircraft with "flying wing" design?
Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?
Sorting numerically
Is it true that "carbohydrates are of no use for the basal metabolic need"?
How discoverable are IPv6 addresses and AAAA names by potential attackers?
Are my PIs rude or am I just being too sensitive?
Predict a vector of values with constraints?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.
The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.
> #packages
> suppressPackageStartupMessages(library(tidyverse))
> suppressPackageStartupMessages(library(poliscidata))
> suppressPackageStartupMessages(library(rms))
> suppressPackageStartupMessages(library(magrittr))
> d = states
> #recode
> d$black = d$blkpct10 / 100
> d$hispanic = d$hispanic10 / 100
> d$white = 1 - d$black - d$hispanic
> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1
> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033
Residuals
Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075
Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803
> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028
Residuals
Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087
Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764
> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024
Residuals
Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163
Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167
> #get predictions
> d$ols_black = predict(ols_black)
> d$ols_hispanic = predict(ols_hispanic)
> d$ols_white = predict(ols_white)
> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1
> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403
> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model
ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314
Residuals
black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706
Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419
> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]
> d$joint_ols_hispanic = predict(ols_joint)[, 2]
> d$joint_ols_white = predict(ols_joint)[, 3]
> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1
> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974
Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).
My questions are:
- How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.
- In general, how does one force outcome constraints such as the sum=1 in the above case?
- How does OLS know in the above code not to output inappropriate values?
I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?
ETA
Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.
multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data
$endgroup$
|
show 3 more comments
$begingroup$
I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.
The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.
> #packages
> suppressPackageStartupMessages(library(tidyverse))
> suppressPackageStartupMessages(library(poliscidata))
> suppressPackageStartupMessages(library(rms))
> suppressPackageStartupMessages(library(magrittr))
> d = states
> #recode
> d$black = d$blkpct10 / 100
> d$hispanic = d$hispanic10 / 100
> d$white = 1 - d$black - d$hispanic
> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1
> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033
Residuals
Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075
Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803
> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028
Residuals
Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087
Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764
> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024
Residuals
Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163
Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167
> #get predictions
> d$ols_black = predict(ols_black)
> d$ols_hispanic = predict(ols_hispanic)
> d$ols_white = predict(ols_white)
> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1
> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403
> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model
ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314
Residuals
black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706
Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419
> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]
> d$joint_ols_hispanic = predict(ols_joint)[, 2]
> d$joint_ols_white = predict(ols_joint)[, 3]
> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1
> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974
Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).
My questions are:
- How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.
- In general, how does one force outcome constraints such as the sum=1 in the above case?
- How does OLS know in the above code not to output inappropriate values?
I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?
ETA
Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.
multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data
$endgroup$
3
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
1
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
2
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
2
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday
|
show 3 more comments
$begingroup$
I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.
The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.
> #packages
> suppressPackageStartupMessages(library(tidyverse))
> suppressPackageStartupMessages(library(poliscidata))
> suppressPackageStartupMessages(library(rms))
> suppressPackageStartupMessages(library(magrittr))
> d = states
> #recode
> d$black = d$blkpct10 / 100
> d$hispanic = d$hispanic10 / 100
> d$white = 1 - d$black - d$hispanic
> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1
> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033
Residuals
Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075
Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803
> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028
Residuals
Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087
Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764
> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024
Residuals
Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163
Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167
> #get predictions
> d$ols_black = predict(ols_black)
> d$ols_hispanic = predict(ols_hispanic)
> d$ols_white = predict(ols_white)
> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1
> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403
> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model
ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314
Residuals
black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706
Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419
> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]
> d$joint_ols_hispanic = predict(ols_joint)[, 2]
> d$joint_ols_white = predict(ols_joint)[, 3]
> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1
> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974
Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).
My questions are:
- How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.
- In general, how does one force outcome constraints such as the sum=1 in the above case?
- How does OLS know in the above code not to output inappropriate values?
I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?
ETA
Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.
multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data
$endgroup$
I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.
The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.
> #packages
> suppressPackageStartupMessages(library(tidyverse))
> suppressPackageStartupMessages(library(poliscidata))
> suppressPackageStartupMessages(library(rms))
> suppressPackageStartupMessages(library(magrittr))
> d = states
> #recode
> d$black = d$blkpct10 / 100
> d$hispanic = d$hispanic10 / 100
> d$white = 1 - d$black - d$hispanic
> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1
> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033
Residuals
Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075
Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803
> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028
Residuals
Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087
Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764
> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model
ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024
Residuals
Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163
Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167
> #get predictions
> d$ols_black = predict(ols_black)
> d$ols_hispanic = predict(ols_hispanic)
> d$ols_white = predict(ols_white)
> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1
> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403
> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model
ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314
Residuals
black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706
Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419
> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]
> d$joint_ols_hispanic = predict(ols_joint)[, 2]
> d$joint_ols_white = predict(ols_joint)[, 3]
> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1
> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974
Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).
My questions are:
- How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.
- In general, how does one force outcome constraints such as the sum=1 in the above case?
- How does OLS know in the above code not to output inappropriate values?
I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?
ETA
Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.
multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data
multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data
edited yesterday
Deleet
asked yesterday
DeleetDeleet
244212
244212
3
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
1
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
2
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
2
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday
|
show 3 more comments
3
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
1
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
2
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
2
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday
3
3
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
1
1
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
2
2
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
2
2
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday
|
show 3 more comments
2 Answers
2
active
oldest
votes
$begingroup$
In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.
You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg
package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:
https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf
The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.
$endgroup$
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
add a comment |
$begingroup$
First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.
The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).
To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.
More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402926%2fpredict-a-vector-of-values-with-constraints%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.
You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg
package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:
https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf
The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.
$endgroup$
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
add a comment |
$begingroup$
In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.
You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg
package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:
https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf
The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.
$endgroup$
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
add a comment |
$begingroup$
In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.
You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg
package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:
https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf
The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.
$endgroup$
In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.
You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg
package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:
https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf
The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.
answered yesterday
AlexKAlexK
2108
2108
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
add a comment |
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday
add a comment |
$begingroup$
First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.
The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).
To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.
More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.
$endgroup$
add a comment |
$begingroup$
First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.
The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).
To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.
More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.
$endgroup$
add a comment |
$begingroup$
First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.
The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).
To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.
More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.
$endgroup$
First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.
The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).
To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.
More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.
answered yesterday
nanomannanoman
43113
43113
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402926%2fpredict-a-vector-of-values-with-constraints%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday
1
$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday
$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday
2
$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday
2
$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday