Predict a vector of values with constraints? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?

What's the purpose of writing one's academic bio in 3rd person?

Is there a "higher Segal conjecture"?

How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?

Models of set theory where not every set can be linearly ordered

I am not a queen, who am I?

Is high blood pressure ever a symptom attributable solely to dehydration?

"Seemed to had" is it correct?

How to bypass password on Windows XP account?

Should I discuss the type of campaign with my players?

Disable hyphenation for an entire paragraph

Do you forfeit tax refunds/credits if you aren't required to and don't file by April 15?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

When to stop saving and start investing?

How can players work together to take actions that are otherwise impossible?

Does surprise arrest existing movement?

Examples of mediopassive verb constructions

Is the Standard Deduction better than Itemized when both are the same amount?

Is there a Spanish version of "dot your i's and cross your t's" that includes the letter 'ñ'?

Why are there no cargo aircraft with "flying wing" design?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Sorting numerically

Is it true that "carbohydrates are of no use for the basal metabolic need"?

How discoverable are IPv6 addresses and AAAA names by potential attackers?

Are my PIs rude or am I just being too sensitive?



Predict a vector of values with constraints?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.



The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.



> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033

Residuals

Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075


Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028

Residuals

Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087


Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024

Residuals

Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163


Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314

Residuals

black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706


Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974


Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).



My questions are:



  1. How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

  2. In general, how does one force outcome constraints such as the sum=1 in the above case?

  3. How does OLS know in the above code not to output inappropriate values?

I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?



ETA



Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.










share|cite|improve this question











$endgroup$







  • 3




    $begingroup$
    Possible duplicate of Modelling Time Series of percentages
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 1




    $begingroup$
    The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
    $endgroup$
    – Stephan Kolassa
    yesterday










  • $begingroup$
    Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
    $endgroup$
    – Deleet
    yesterday






  • 2




    $begingroup$
    I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 2




    $begingroup$
    Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
    $endgroup$
    – Stephan Kolassa
    yesterday

















2












$begingroup$


I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.



The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.



> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033

Residuals

Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075


Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028

Residuals

Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087


Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024

Residuals

Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163


Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314

Residuals

black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706


Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974


Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).



My questions are:



  1. How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

  2. In general, how does one force outcome constraints such as the sum=1 in the above case?

  3. How does OLS know in the above code not to output inappropriate values?

I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?



ETA



Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.










share|cite|improve this question











$endgroup$







  • 3




    $begingroup$
    Possible duplicate of Modelling Time Series of percentages
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 1




    $begingroup$
    The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
    $endgroup$
    – Stephan Kolassa
    yesterday










  • $begingroup$
    Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
    $endgroup$
    – Deleet
    yesterday






  • 2




    $begingroup$
    I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 2




    $begingroup$
    Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
    $endgroup$
    – Stephan Kolassa
    yesterday













2












2








2





$begingroup$


I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.



The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.



> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033

Residuals

Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075


Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028

Residuals

Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087


Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024

Residuals

Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163


Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314

Residuals

black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706


Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974


Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).



My questions are:



  1. How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

  2. In general, how does one force outcome constraints such as the sum=1 in the above case?

  3. How does OLS know in the above code not to output inappropriate values?

I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?



ETA



Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.










share|cite|improve this question











$endgroup$




I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.



The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.



> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
46 47 48 49 50
1 1 1 1 1

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 4.81 R2 0.092
sigma0.0958 d.f. 4 R2 adj 0.011
d.f. 45 Pr(> chi2) 0.3068 g 0.033

Residuals

Min 1Q Median 3Q Max
-0.11768 -0.06635 -0.03056 0.05120 0.25075


Coef S.E. t Pr(>|t|)
Intercept 0.2077 0.1505 1.38 0.1744
abortion_rank12 -0.0016 0.0013 -1.28 0.2063
ba_or_more 0.0000 0.0042 0.00 0.9970
cig_tax12 -0.0209 0.0211 -0.99 0.3272
conserv_advantage -0.0014 0.0024 -0.56 0.5803


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 3.11 R2 0.060
sigma0.1010 d.f. 4 R2 adj -0.023
d.f. 45 Pr(> chi2) 0.5400 g 0.028

Residuals

Min 1Q Median 3Q Max
-0.13098 -0.05541 -0.02833 0.02041 0.34087


Coef S.E. t Pr(>|t|)
Intercept 0.1112 0.1586 0.70 0.4869
abortion_rank12 0.0011 0.0013 0.87 0.3904
ba_or_more 0.0001 0.0044 0.01 0.9899
cig_tax12 -0.0069 0.0222 -0.31 0.7558
conserv_advantage -0.0014 0.0026 -0.56 0.5764


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 +
conserv_advantage, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 50 LR chi2 1.29 R2 0.025
sigma0.1344 d.f. 4 R2 adj -0.061
d.f. 45 Pr(> chi2) 0.8635 g 0.024

Residuals

Min 1Q Median 3Q Max
-0.29197 -0.11366 0.02988 0.08802 0.19163


Coef S.E. t Pr(>|t|)
Intercept 0.6811 0.2111 3.23 0.0023
abortion_rank12 0.0005 0.0018 0.26 0.7938
ba_or_more -0.0001 0.0059 -0.01 0.9903
cig_tax12 0.0278 0.0296 0.94 0.3517
conserv_advantage 0.0028 0.0034 0.82 0.4167


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 +
ba_or_more, data = d)

Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 342.38 R2 0.898
sigma0.1911 d.f. 8 R2 adj 0.892
d.f. 141 Pr(> chi2) 0.0000 g 0.314

Residuals

black hispanic white
Min -0.12063 -0.12726 -0.28038
1Q -0.06388 -0.05599 -0.10255
Median -0.03327 -0.02423 0.04949
3Q 0.05821 0.01528 0.09565
Max 0.24543 0.34242 0.18706


Coef S.E. t Pr(>|t|)
[1,] 0.1552 0.1636 0.95 0.3442
[2,] -0.0018 0.0022 -0.82 0.4133
[3,] 0.0001 0.0067 0.02 0.9868
[4,] 0.0384 0.1636 0.23 0.8146
[5,] 0.0013 0.0022 0.62 0.5392
[6,] 0.0012 0.0067 0.18 0.8549
[7,] 0.8063 0.1636 4.93 <0.0001
[8,] 0.0004 0.0022 0.21 0.8378
[9,] -0.0013 0.0067 -0.20 0.8419


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974


Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).



My questions are:



  1. How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

  2. In general, how does one force outcome constraints such as the sum=1 in the above case?

  3. How does OLS know in the above code not to output inappropriate values?

I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?



ETA



Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.







multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited yesterday







Deleet

















asked yesterday









DeleetDeleet

244212




244212







  • 3




    $begingroup$
    Possible duplicate of Modelling Time Series of percentages
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 1




    $begingroup$
    The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
    $endgroup$
    – Stephan Kolassa
    yesterday










  • $begingroup$
    Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
    $endgroup$
    – Deleet
    yesterday






  • 2




    $begingroup$
    I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 2




    $begingroup$
    Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
    $endgroup$
    – Stephan Kolassa
    yesterday












  • 3




    $begingroup$
    Possible duplicate of Modelling Time Series of percentages
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 1




    $begingroup$
    The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
    $endgroup$
    – Stephan Kolassa
    yesterday










  • $begingroup$
    Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
    $endgroup$
    – Deleet
    yesterday






  • 2




    $begingroup$
    I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
    $endgroup$
    – Stephan Kolassa
    yesterday






  • 2




    $begingroup$
    Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
    $endgroup$
    – Stephan Kolassa
    yesterday







3




3




$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday




$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday




1




1




$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday




$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday












$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday




$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday




2




2




$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday




$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday




2




2




$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday




$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday










2 Answers
2






active

oldest

votes


















1












$begingroup$

In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.



You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:



https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf



The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
    $endgroup$
    – Deleet
    yesterday










  • $begingroup$
    Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
    $endgroup$
    – AlexK
    yesterday










  • $begingroup$
    Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
    $endgroup$
    – nanoman
    yesterday










  • $begingroup$
    @nanoman, sure, with one small caveat, as described in the comment to this answer
    $endgroup$
    – AlexK
    yesterday


















1












$begingroup$

First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.



The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).



To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.



More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.






share|cite|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402926%2fpredict-a-vector-of-values-with-constraints%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.



    You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:



    https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf



    The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
      $endgroup$
      – Deleet
      yesterday










    • $begingroup$
      Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
      $endgroup$
      – AlexK
      yesterday










    • $begingroup$
      Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
      $endgroup$
      – nanoman
      yesterday










    • $begingroup$
      @nanoman, sure, with one small caveat, as described in the comment to this answer
      $endgroup$
      – AlexK
      yesterday















    1












    $begingroup$

    In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.



    You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:



    https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf



    The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
      $endgroup$
      – Deleet
      yesterday










    • $begingroup$
      Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
      $endgroup$
      – AlexK
      yesterday










    • $begingroup$
      Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
      $endgroup$
      – nanoman
      yesterday










    • $begingroup$
      @nanoman, sure, with one small caveat, as described in the comment to this answer
      $endgroup$
      – AlexK
      yesterday













    1












    1








    1





    $begingroup$

    In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.



    You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:



    https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf



    The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.






    share|cite|improve this answer









    $endgroup$



    In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.



    You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:



    https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf



    The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered yesterday









    AlexKAlexK

    2108




    2108











    • $begingroup$
      But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
      $endgroup$
      – Deleet
      yesterday










    • $begingroup$
      Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
      $endgroup$
      – AlexK
      yesterday










    • $begingroup$
      Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
      $endgroup$
      – nanoman
      yesterday










    • $begingroup$
      @nanoman, sure, with one small caveat, as described in the comment to this answer
      $endgroup$
      – AlexK
      yesterday
















    • $begingroup$
      But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
      $endgroup$
      – Deleet
      yesterday










    • $begingroup$
      Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
      $endgroup$
      – AlexK
      yesterday










    • $begingroup$
      Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
      $endgroup$
      – nanoman
      yesterday










    • $begingroup$
      @nanoman, sure, with one small caveat, as described in the comment to this answer
      $endgroup$
      – AlexK
      yesterday















    $begingroup$
    But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
    $endgroup$
    – Deleet
    yesterday




    $begingroup$
    But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
    $endgroup$
    – Deleet
    yesterday












    $begingroup$
    Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
    $endgroup$
    – AlexK
    yesterday




    $begingroup$
    Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
    $endgroup$
    – AlexK
    yesterday












    $begingroup$
    Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
    $endgroup$
    – nanoman
    yesterday




    $begingroup$
    Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
    $endgroup$
    – nanoman
    yesterday












    $begingroup$
    @nanoman, sure, with one small caveat, as described in the comment to this answer
    $endgroup$
    – AlexK
    yesterday




    $begingroup$
    @nanoman, sure, with one small caveat, as described in the comment to this answer
    $endgroup$
    – AlexK
    yesterday













    1












    $begingroup$

    First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.



    The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).



    To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.



    More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.






    share|cite|improve this answer









    $endgroup$

















      1












      $begingroup$

      First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.



      The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).



      To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.



      More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.






      share|cite|improve this answer









      $endgroup$















        1












        1








        1





        $begingroup$

        First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.



        The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).



        To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.



        More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.






        share|cite|improve this answer









        $endgroup$



        First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.



        The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).



        To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.



        More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered yesterday









        nanomannanoman

        43113




        43113



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402926%2fpredict-a-vector-of-values-with-constraints%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

            Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

            Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?