Predict a vector of values with constraints? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?

What's the purpose of writing one's academic bio in 3rd person?

Is there a "higher Segal conjecture"?

How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?

Models of set theory where not every set can be linearly ordered

I am not a queen, who am I?

Is high blood pressure ever a symptom attributable solely to dehydration?

"Seemed to had" is it correct?

How to bypass password on Windows XP account?

Should I discuss the type of campaign with my players?

Disable hyphenation for an entire paragraph

Do you forfeit tax refunds/credits if you aren't required to and don't file by April 15?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

When to stop saving and start investing?

How can players work together to take actions that are otherwise impossible?

Does surprise arrest existing movement?

Examples of mediopassive verb constructions

Is the Standard Deduction better than Itemized when both are the same amount?

Is there a Spanish version of "dot your i's and cross your t's" that includes the letter 'ñ'?

Why are there no cargo aircraft with "flying wing" design?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Sorting numerically

Is it true that "carbohydrates are of no use for the basal metabolic need"?

How discoverable are IPv6 addresses and AAAA names by potential attackers?

Are my PIs rude or am I just being too sensitive?

Predict a vector of values with constraints?

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Seemingly unrelated regression and OLSModelling Time Series of percentagesLeast squares with multiple constraintsFit GAMs with minimum constraintsLinear Regression with individual constraints in RIs this a Poisson GLM problem? or OLS? Predicting Counts for a future window of timeUsing a model with domain/image constraintsregression with constraintsOptimization with changing constraints based on parameters to optimize?Estimation with adding-up constraints when the constraint is a coefficientLeast Squares with coefficient constraintsHow to perform Least Squares with constraints on a subset of the model coefficients?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am aware of a variety of methods for simultaneously predicting multiple outcomes known sometimes as multivariate regression/analysis. However, my situation is a little more special. I am trying to predict a vector of values and where each part ranges from 0-1, and the sum of the vector must be equal to 1. A typical example of this would be population fractions of exclusive groups.

The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.

> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
46 47 48 49 50 
 1 1 1 1 1 

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 4.81 R2 0.092 
 sigma0.0958 d.f. 4 R2 adj 0.011 
 d.f. 45 Pr(> chi2) 0.3068 g 0.033 

 Residuals

 Min 1Q Median 3Q Max 
 -0.11768 -0.06635 -0.03056 0.05120 0.25075 


 Coef S.E. t Pr(>|t|)
 Intercept 0.2077 0.1505 1.38 0.1744 
 abortion_rank12 -0.0016 0.0013 -1.28 0.2063 
 ba_or_more 0.0000 0.0042 0.00 0.9970 
 cig_tax12 -0.0209 0.0211 -0.99 0.3272 
 conserv_advantage -0.0014 0.0024 -0.56 0.5803 


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 3.11 R2 0.060 
 sigma0.1010 d.f. 4 R2 adj -0.023 
 d.f. 45 Pr(> chi2) 0.5400 g 0.028 

 Residuals

 Min 1Q Median 3Q Max 
 -0.13098 -0.05541 -0.02833 0.02041 0.34087 


 Coef S.E. t Pr(>|t|)
 Intercept 0.1112 0.1586 0.70 0.4869 
 abortion_rank12 0.0011 0.0013 0.87 0.3904 
 ba_or_more 0.0001 0.0044 0.01 0.9899 
 cig_tax12 -0.0069 0.0222 -0.31 0.7558 
 conserv_advantage -0.0014 0.0026 -0.56 0.5764 


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 1.29 R2 0.025 
 sigma0.1344 d.f. 4 R2 adj -0.061 
 d.f. 45 Pr(> chi2) 0.8635 g 0.024 

 Residuals

 Min 1Q Median 3Q Max 
 -0.29197 -0.11366 0.02988 0.08802 0.19163 


 Coef S.E. t Pr(>|t|)
 Intercept 0.6811 0.2111 3.23 0.0023 
 abortion_rank12 0.0005 0.0018 0.26 0.7938 
 ba_or_more -0.0001 0.0059 -0.01 0.9903 
 cig_tax12 0.0278 0.0296 0.94 0.3517 
 conserv_advantage 0.0028 0.0034 0.82 0.4167 


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
 ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

 ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 + 
 ba_or_more, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 150 LR chi2 342.38 R2 0.898 
 sigma0.1911 d.f. 8 R2 adj 0.892 
 d.f. 141 Pr(> chi2) 0.0000 g 0.314 

 Residuals

 black hispanic white
 Min -0.12063 -0.12726 -0.28038
 1Q -0.06388 -0.05599 -0.10255
 Median -0.03327 -0.02423 0.04949
 3Q 0.05821 0.01528 0.09565
 Max 0.24543 0.34242 0.18706


 Coef S.E. t Pr(>|t|)
 [1,] 0.1552 0.1636 0.95 0.3442 
 [2,] -0.0018 0.0022 -0.82 0.4133 
 [3,] 0.0001 0.0067 0.02 0.9868 
 [4,] 0.0384 0.1636 0.23 0.8146 
 [5,] 0.0013 0.0022 0.62 0.5392 
 [6,] 0.0012 0.0067 0.18 0.8549 
 [7,] 0.8063 0.1636 4.93 <0.0001 
 [8,] 0.0004 0.0022 0.21 0.8378 
 [9,] -0.0013 0.0067 -0.20 0.8419 


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
 joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974

Thus, in my case, as far as I understand, the OLS does not actually know the values must range from 0-1, and sum to 1, yet this somehow happens with the data when doing regressions one by one, as well as in the joint OLS (i.e. multivariate regression).

My questions are:

How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

In general, how does one force outcome constraints such as the sum=1 in the above case?

How does OLS know in the above code not to output inappropriate values?

I imagine the above issues can be approached with explicit Bayesian approaches where the priors reflect the outcome ranges, though not sure how one would handle the sum constraint. Is there a frequentist approach as well?

ETA

Stephan Kolassa points to this prior question, which also concerns the issue of proportional data, also called compositional data. However, it differs from the present question in that the question concerns time series data, whereas the present does not, and that I'm explicitly interested in retaining modeling on the proportional data, whereas the approach in the answer question is to transform the data. I am essentially looking for e.g. links to R packages (or maybe Python) that enable modeling on proportional data and predictions on the same scale.

edited yesterday

asked yesterday

Deleet

244212

3

$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday

1

$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday

$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday

2

$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday

2

$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday

|
show 3 more comments

The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.

> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
46 47 48 49 50 
 1 1 1 1 1 

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 4.81 R2 0.092 
 sigma0.0958 d.f. 4 R2 adj 0.011 
 d.f. 45 Pr(> chi2) 0.3068 g 0.033 

 Residuals

 Min 1Q Median 3Q Max 
 -0.11768 -0.06635 -0.03056 0.05120 0.25075 


 Coef S.E. t Pr(>|t|)
 Intercept 0.2077 0.1505 1.38 0.1744 
 abortion_rank12 -0.0016 0.0013 -1.28 0.2063 
 ba_or_more 0.0000 0.0042 0.00 0.9970 
 cig_tax12 -0.0209 0.0211 -0.99 0.3272 
 conserv_advantage -0.0014 0.0024 -0.56 0.5803 


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 3.11 R2 0.060 
 sigma0.1010 d.f. 4 R2 adj -0.023 
 d.f. 45 Pr(> chi2) 0.5400 g 0.028 

 Residuals

 Min 1Q Median 3Q Max 
 -0.13098 -0.05541 -0.02833 0.02041 0.34087 


 Coef S.E. t Pr(>|t|)
 Intercept 0.1112 0.1586 0.70 0.4869 
 abortion_rank12 0.0011 0.0013 0.87 0.3904 
 ba_or_more 0.0001 0.0044 0.01 0.9899 
 cig_tax12 -0.0069 0.0222 -0.31 0.7558 
 conserv_advantage -0.0014 0.0026 -0.56 0.5764 


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 1.29 R2 0.025 
 sigma0.1344 d.f. 4 R2 adj -0.061 
 d.f. 45 Pr(> chi2) 0.8635 g 0.024 

 Residuals

 Min 1Q Median 3Q Max 
 -0.29197 -0.11366 0.02988 0.08802 0.19163 


 Coef S.E. t Pr(>|t|)
 Intercept 0.6811 0.2111 3.23 0.0023 
 abortion_rank12 0.0005 0.0018 0.26 0.7938 
 ba_or_more -0.0001 0.0059 -0.01 0.9903 
 cig_tax12 0.0278 0.0296 0.94 0.3517 
 conserv_advantage 0.0028 0.0034 0.82 0.4167 


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
 ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

 ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 + 
 ba_or_more, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 150 LR chi2 342.38 R2 0.898 
 sigma0.1911 d.f. 8 R2 adj 0.892 
 d.f. 141 Pr(> chi2) 0.0000 g 0.314 

 Residuals

 black hispanic white
 Min -0.12063 -0.12726 -0.28038
 1Q -0.06388 -0.05599 -0.10255
 Median -0.03327 -0.02423 0.04949
 3Q 0.05821 0.01528 0.09565
 Max 0.24543 0.34242 0.18706


 Coef S.E. t Pr(>|t|)
 [1,] 0.1552 0.1636 0.95 0.3442 
 [2,] -0.0018 0.0022 -0.82 0.4133 
 [3,] 0.0001 0.0067 0.02 0.9868 
 [4,] 0.0384 0.1636 0.23 0.8146 
 [5,] 0.0013 0.0022 0.62 0.5392 
 [6,] 0.0012 0.0067 0.18 0.8549 
 [7,] 0.8063 0.1636 4.93 <0.0001 
 [8,] 0.0004 0.0022 0.21 0.8378 
 [9,] -0.0013 0.0067 -0.20 0.8419 


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
 joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974

My questions are:

How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

In general, how does one force outcome constraints such as the sum=1 in the above case?

How does OLS know in the above code not to output inappropriate values?

ETA

edited yesterday

asked yesterday

Deleet

244212

3

$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday

1

$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday

$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday

2

$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday

2

$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday

|
show 3 more comments

The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.

> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
46 47 48 49 50 
 1 1 1 1 1 

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 4.81 R2 0.092 
 sigma0.0958 d.f. 4 R2 adj 0.011 
 d.f. 45 Pr(> chi2) 0.3068 g 0.033 

 Residuals

 Min 1Q Median 3Q Max 
 -0.11768 -0.06635 -0.03056 0.05120 0.25075 


 Coef S.E. t Pr(>|t|)
 Intercept 0.2077 0.1505 1.38 0.1744 
 abortion_rank12 -0.0016 0.0013 -1.28 0.2063 
 ba_or_more 0.0000 0.0042 0.00 0.9970 
 cig_tax12 -0.0209 0.0211 -0.99 0.3272 
 conserv_advantage -0.0014 0.0024 -0.56 0.5803 


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 3.11 R2 0.060 
 sigma0.1010 d.f. 4 R2 adj -0.023 
 d.f. 45 Pr(> chi2) 0.5400 g 0.028 

 Residuals

 Min 1Q Median 3Q Max 
 -0.13098 -0.05541 -0.02833 0.02041 0.34087 


 Coef S.E. t Pr(>|t|)
 Intercept 0.1112 0.1586 0.70 0.4869 
 abortion_rank12 0.0011 0.0013 0.87 0.3904 
 ba_or_more 0.0001 0.0044 0.01 0.9899 
 cig_tax12 -0.0069 0.0222 -0.31 0.7558 
 conserv_advantage -0.0014 0.0026 -0.56 0.5764 


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 1.29 R2 0.025 
 sigma0.1344 d.f. 4 R2 adj -0.061 
 d.f. 45 Pr(> chi2) 0.8635 g 0.024 

 Residuals

 Min 1Q Median 3Q Max 
 -0.29197 -0.11366 0.02988 0.08802 0.19163 


 Coef S.E. t Pr(>|t|)
 Intercept 0.6811 0.2111 3.23 0.0023 
 abortion_rank12 0.0005 0.0018 0.26 0.7938 
 ba_or_more -0.0001 0.0059 -0.01 0.9903 
 cig_tax12 0.0278 0.0296 0.94 0.3517 
 conserv_advantage 0.0028 0.0034 0.82 0.4167 


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
 ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

 ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 + 
 ba_or_more, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 150 LR chi2 342.38 R2 0.898 
 sigma0.1911 d.f. 8 R2 adj 0.892 
 d.f. 141 Pr(> chi2) 0.0000 g 0.314 

 Residuals

 black hispanic white
 Min -0.12063 -0.12726 -0.28038
 1Q -0.06388 -0.05599 -0.10255
 Median -0.03327 -0.02423 0.04949
 3Q 0.05821 0.01528 0.09565
 Max 0.24543 0.34242 0.18706


 Coef S.E. t Pr(>|t|)
 [1,] 0.1552 0.1636 0.95 0.3442 
 [2,] -0.0018 0.0022 -0.82 0.4133 
 [3,] 0.0001 0.0067 0.02 0.9868 
 [4,] 0.0384 0.1636 0.23 0.8146 
 [5,] 0.0013 0.0022 0.62 0.5392 
 [6,] 0.0012 0.0067 0.18 0.8549 
 [7,] 0.8063 0.1636 4.93 <0.0001 
 [8,] 0.0004 0.0022 0.21 0.8378 
 [9,] -0.0013 0.0067 -0.20 0.8419 


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
 joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974

My questions are:

How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

In general, how does one force outcome constraints such as the sum=1 in the above case?

How does OLS know in the above code not to output inappropriate values?

ETA

edited yesterday

asked yesterday

Deleet

244212

The most simple approach to this I can think of is to use OLS and rescale the predictions so they do not violate data structure. Here's an example using R and US state population data.

> #packages
> suppressPackageStartupMessages(library(tidyverse))

> suppressPackageStartupMessages(library(poliscidata))

> suppressPackageStartupMessages(library(rms))

> suppressPackageStartupMessages(library(magrittr))

> d = states

> #recode
> d$black = d$blkpct10 / 100

> d$hispanic = d$hispanic10 / 100

> d$white = 1 - d$black - d$hispanic

> #test sums
> rowSums(d %>% select(black, hispanic, white))
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
46 47 48 49 50 
 1 1 1 1 1 

> #regression
> #using 4 variables
> #OLS
> (ols_black = ols(black ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = black ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 4.81 R2 0.092 
 sigma0.0958 d.f. 4 R2 adj 0.011 
 d.f. 45 Pr(> chi2) 0.3068 g 0.033 

 Residuals

 Min 1Q Median 3Q Max 
 -0.11768 -0.06635 -0.03056 0.05120 0.25075 


 Coef S.E. t Pr(>|t|)
 Intercept 0.2077 0.1505 1.38 0.1744 
 abortion_rank12 -0.0016 0.0013 -1.28 0.2063 
 ba_or_more 0.0000 0.0042 0.00 0.9970 
 cig_tax12 -0.0209 0.0211 -0.99 0.3272 
 conserv_advantage -0.0014 0.0024 -0.56 0.5803 


> (ols_hispanic = ols(hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = hispanic ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 3.11 R2 0.060 
 sigma0.1010 d.f. 4 R2 adj -0.023 
 d.f. 45 Pr(> chi2) 0.5400 g 0.028 

 Residuals

 Min 1Q Median 3Q Max 
 -0.13098 -0.05541 -0.02833 0.02041 0.34087 


 Coef S.E. t Pr(>|t|)
 Intercept 0.1112 0.1586 0.70 0.4869 
 abortion_rank12 0.0011 0.0013 0.87 0.3904 
 ba_or_more 0.0001 0.0044 0.01 0.9899 
 cig_tax12 -0.0069 0.0222 -0.31 0.7558 
 conserv_advantage -0.0014 0.0026 -0.56 0.5764 


> (ols_white = ols(white ~ abortion_rank12 + ba_or_more + cig_tax12 + conserv_advantage, data = d))
Linear Regression Model

 ols(formula = white ~ abortion_rank12 + ba_or_more + cig_tax12 + 
 conserv_advantage, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 50 LR chi2 1.29 R2 0.025 
 sigma0.1344 d.f. 4 R2 adj -0.061 
 d.f. 45 Pr(> chi2) 0.8635 g 0.024 

 Residuals

 Min 1Q Median 3Q Max 
 -0.29197 -0.11366 0.02988 0.08802 0.19163 


 Coef S.E. t Pr(>|t|)
 Intercept 0.6811 0.2111 3.23 0.0023 
 abortion_rank12 0.0005 0.0018 0.26 0.7938 
 ba_or_more -0.0001 0.0059 -0.01 0.9903 
 cig_tax12 0.0278 0.0296 0.94 0.3517 
 conserv_advantage 0.0028 0.0034 0.82 0.4167 


> #get predictions
> d$ols_black = predict(ols_black)

> d$ols_hispanic = predict(ols_hispanic)

> d$ols_white = predict(ols_white)

> #inspect predictions
> d %>% select(starts_with("ols")) %>% mutate(sum = rowSums(.))
 ols_black ols_hispanic ols_white sum
1 0.08129250 0.10802760 0.8106799 1
2 0.11823321 0.08031043 0.8014564 1
3 0.14136889 0.07023172 0.7883994 1
4 0.13189772 0.07624972 0.7918526 1
5 0.10280396 0.15374533 0.7434507 1
6 0.12930893 0.11325306 0.7574380 1
7 0.06261715 0.13842026 0.7989626 1
8 0.11806119 0.12687147 0.7550673 1
9 0.11508647 0.10817106 0.7767425 1
10 0.15097533 0.08315063 0.7658740 1
11 0.06825851 0.13257364 0.7991678 1
12 0.09290831 0.11626503 0.7908267 1
13 0.12003177 0.09022122 0.7897470 1
14 0.09482377 0.12513629 0.7800399 1
15 0.14432113 0.07970917 0.7759697 1
16 0.14472842 0.08870156 0.7665700 1
17 0.14109725 0.09872039 0.7601824 1
18 0.15769332 0.06708056 0.7752261 1
19 0.09469234 0.14480432 0.7605033 1
20 0.09043738 0.14094802 0.7686146 1
21 0.10129247 0.11791403 0.7807935 1
22 0.12064080 0.10132318 0.7780360 1
23 0.11602388 0.11910634 0.7648698 1
24 0.16377700 0.09077838 0.7454446 1
25 0.12524587 0.07730750 0.7974466 1
26 0.07060243 0.10921933 0.8201782 1
27 0.12703322 0.11021081 0.7627560 1
28 0.13367696 0.07430030 0.7920227 1
29 0.15029178 0.07798233 0.7717259 1
30 0.10685246 0.12199943 0.7711481 1
31 0.07058313 0.13901865 0.7903982 1
32 0.09142246 0.12212662 0.7864509 1
33 0.10757628 0.12890212 0.7635216 1
34 0.03493501 0.13189804 0.8331670 1
35 0.13176324 0.09598102 0.7722557 1
36 0.14334473 0.06494498 0.7917103 1
37 0.10788090 0.14958898 0.7425301 1
38 0.14930675 0.08299583 0.7676974 1
39 0.09010154 0.12273524 0.7871632 1
40 0.12960586 0.09186749 0.7785267 1
41 0.12586666 0.08545994 0.7886734 1
42 0.12252536 0.09645709 0.7810176 1
43 0.12473591 0.08529285 0.7899712 1
44 0.09099896 0.08246437 0.8265367 1
45 0.16097493 0.09583502 0.7431901 1
46 0.07564659 0.14598479 0.7783686 1
47 0.05854270 0.14244206 0.7990152 1
48 0.09670796 0.09239404 0.8108980 1
49 0.10933095 0.10965351 0.7810155 1
50 0.09307565 0.09622426 0.8107001 1

> #fit accuracies
> d %>% select(starts_with("ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
ols_black 0.3029909 -0.17497676 -0.08992821
ols_hispanic -0.2159748 0.24547482 -0.02824165
ols_white -0.1708902 -0.04347992 0.15944403

> #multivariate OLS
> (ols_joint = ols(cbind(black, hispanic, white) ~ abortion_rank12 + ba_or_more, data = d))
Linear Regression Model

 ols(formula = cbind(black, hispanic, white) ~ abortion_rank12 + 
 ba_or_more, data = d)

 Model Likelihood Discrimination 
 Ratio Test Indexes 
 Obs 150 LR chi2 342.38 R2 0.898 
 sigma0.1911 d.f. 8 R2 adj 0.892 
 d.f. 141 Pr(> chi2) 0.0000 g 0.314 

 Residuals

 black hispanic white
 Min -0.12063 -0.12726 -0.28038
 1Q -0.06388 -0.05599 -0.10255
 Median -0.03327 -0.02423 0.04949
 3Q 0.05821 0.01528 0.09565
 Max 0.24543 0.34242 0.18706


 Coef S.E. t Pr(>|t|)
 [1,] 0.1552 0.1636 0.95 0.3442 
 [2,] -0.0018 0.0022 -0.82 0.4133 
 [3,] 0.0001 0.0067 0.02 0.9868 
 [4,] 0.0384 0.1636 0.23 0.8146 
 [5,] 0.0013 0.0022 0.62 0.5392 
 [6,] 0.0012 0.0067 0.18 0.8549 
 [7,] 0.8063 0.1636 4.93 <0.0001 
 [8,] 0.0004 0.0022 0.21 0.8378 
 [9,] -0.0013 0.0067 -0.20 0.8419 


> #predictions
> d$joint_ols_black = predict(ols_joint)[, 1]

> d$joint_ols_hispanic = predict(ols_joint)[, 2]

> d$joint_ols_white = predict(ols_joint)[, 3]

> #inspect predictions
> d %>% select(starts_with("joint_ols")) %>% mutate(sum = rowSums(.))
 joint_ols_black joint_ols_hispanic joint_ols_white sum
1 0.09555303 0.11815034 0.7862966 1
2 0.12188779 0.09234947 0.7857627 1
3 0.15017948 0.06705244 0.7827681 1
4 0.14913609 0.07664226 0.7742216 1
5 0.07086325 0.14100841 0.7881283 1
6 0.11448727 0.11617229 0.7693404 1
7 0.07865753 0.14265444 0.7786880 1
8 0.10473606 0.11402244 0.7812415 1
9 0.11151654 0.10446698 0.7840165 1
10 0.14218850 0.08435132 0.7734602 1
11 0.08335854 0.13124113 0.7854003 1
12 0.09180628 0.11898908 0.7892046 1
13 0.11851983 0.09737339 0.7841068 1
14 0.09420884 0.12441664 0.7813745 1
15 0.14521110 0.07551151 0.7792774 1
16 0.13883168 0.08949833 0.7716700 1
17 0.12714583 0.08709083 0.7857633 1
18 0.15582745 0.06610206 0.7780705 1
19 0.08789627 0.13914201 0.7729617 1
20 0.08224830 0.14009239 0.7776593 1
21 0.10274571 0.11314933 0.7841050 1
22 0.12575708 0.09286476 0.7813782 1
23 0.10862763 0.11478390 0.7765885 1
24 0.14372208 0.08017760 0.7761003 1
25 0.13056949 0.08268238 0.7867481 1
26 0.08490326 0.12719051 0.7879062 1
27 0.10986041 0.10728667 0.7828529 1
28 0.13662966 0.08628645 0.7770839 1
29 0.14754681 0.08020051 0.7722527 1
30 0.10152407 0.12076966 0.7777063 1
31 0.07674517 0.14264299 0.7806118 1
32 0.09003875 0.12057784 0.7893834 1
33 0.08785901 0.11761214 0.7945288 1
34 0.07293158 0.14274317 0.7843252 1
35 0.12928101 0.08956415 0.7811548 1
36 0.15418246 0.06904484 0.7767727 1
37 0.07973434 0.13343390 0.7868318 1
38 0.15280485 0.07494186 0.7722533 1
39 0.10672641 0.11489554 0.7783781 1
40 0.12393384 0.09383805 0.7822281 1
41 0.13476186 0.08676737 0.7784708 1
42 0.11662975 0.09760812 0.7857621 1
43 0.13301661 0.08860231 0.7783811 1
44 0.11545267 0.10572082 0.7788265 1
45 0.14112283 0.09369495 0.7651822 1
46 0.07479938 0.14226225 0.7829384 1
47 0.06919598 0.14370501 0.7870990 1
48 0.12051018 0.09824650 0.7812433 1
49 0.09809657 0.10401752 0.7978859 1
50 0.09703090 0.11336115 0.7896079 1

> #accuracies
> d %>% select(starts_with("joint_ols"), black, hispanic, white) %>% cor() %>% .[1:3, 4:6]
 black hispanic white
joint_ols_black 0.2679728 -0.22508994 -0.02574528
joint_ols_hispanic -0.2605606 0.23149306 0.01537504
joint_ols_white -0.1416356 0.07306988 0.04870974

My questions are:

How does one generally model outcomes that have a specified range, such as 0-1? This is somewhat different from modeling binary data because while in that case the predictions must be within 0-1 (when transformed from logits), the training data is not binary in this case.

In general, how does one force outcome constraints such as the sum=1 in the above case?

How does OLS know in the above code not to output inappropriate values?

ETA

multivariate-analysis least-squares multivariate-regression constrained-regression compositional-data

edited yesterday

asked yesterday

Deleet

244212

edited yesterday

asked yesterday

Deleet

244212

edited yesterday

asked yesterday

Deleet

244212

asked yesterday

Deleet

244212

asked yesterday

Deleet

244212

3

$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday

1

$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday

$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday

2

$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday

2

$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday

|
show 3 more comments

3

$begingroup$
Possible duplicate of Modelling Time Series of percentages
$endgroup$
– Stephan Kolassa
yesterday

1

$begingroup$
The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data
$endgroup$
– Stephan Kolassa
yesterday

$begingroup$
Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.
$endgroup$
– Deleet
yesterday

2

$begingroup$
I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...
$endgroup$
– Stephan Kolassa
yesterday

2

$begingroup$
Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)
$endgroup$
– Stephan Kolassa
yesterday

Possible duplicate of Modelling Time Series of percentages

– Stephan Kolassa
yesterday

The paper I cite in my answer to the proposed duplicate is concerned with time series forecasting, but the transformation it uses is applicable for more general compositional-data

– Stephan Kolassa
yesterday

Thanks, Stephan. That's not entirely what I was looking for, but good to know it's called compositional data. I'll be googling around for that term.

– Deleet
yesterday

I do not think you will be successful in terms of finding a way to model the original proportional data, but having predictions respect the compositional constraints. Could you explain just why you would prefer to model the original proportions? Modeling transformed data is very common, e.g., in logistic regression. Yes, interpretation becomes more difficult, but it's better to have a model that is hard to interpret, rather than no model at all...

– Stephan Kolassa
yesterday

Well, that paper is explicitly about forecasting, and of course on the original scale. You need to back-transform the predictions from the transformed scale. (You may need some bias-correction in the back-transformation, similar to what you need in Box-Cox transformations, and I don't recall whether that paper goes into that.)

– Stephan Kolassa
yesterday

|
show 3 more comments

2 Answers
2

active

oldest

votes

In response to you showing and asking about the results of OLS models, in general, OLS is not appropriate for truncated data, as it will not adjust the estimates of the coefficients to take into account the effect of truncation, resulting in biased coefficients. It is similar to estimating a logistic regression using OLS. But just as with logistic regression, a linear model will sometimes fit just as well, or almost as well, as a non-linear model. (That's why there is something called a linear probability model that models continuous probability values between 0 and 1 using OLS and that is sometimes used by practitioners instead of logistic regression.) The fact that you obtained "appropriate" predicted values is simply due to your data allowing for a good linear approximation of the data generating process.

You'll probably find good examples of modeling compositional data, but for reference, since you asked for general information and are using R, the censReg package allows for estimation of a censored regression model with both lower and upper limits of the dependent variable that can be any numbers, which is a generalization of the the standard Tobit model with a dependent variable left-censored at zero:

https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf

The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.

answered yesterday

AlexK

2108

$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday

$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday

$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday

$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday

add a comment |

First, your OLS models automatically respect the sum-to-1 constraint (up to roundoff error) because both the models and the constraint are linear in the observations. That is, the sum of the predictions for each input is equivalent to the prediction of an OLS model fit to the sum of observations (a constant of 1). What is not guaranteed is that the OLS predictions will be between 0 and 1, although they happen to be for your data.

The multinomial logit model provides one natural way to transform between unconstrained linear-regression predictions and "probabilities" (positive fractions that sum to 1). This is the transformation $y_itmapsto log(y_it/y_0t)$ in the linked answer (where here you would take $t$ as a sample index, not time).

To make a proper fit systematically, you would need to decide on your error model, which determines the likelihood function. For example, do you imagine that the model can precisely predict probabilities from which the individuals are sampled, and the only error in the population fractions comes from the multinomial distribution based on the population size? If so, you could directly use the likelihood from the multinomial logit model, casting your observations as counts of individuals. But this is implausible for the US state example, because factors not in the model have a much greater effect than the tiny multinomial fluctuations in a state population of millions.

More likely, you just want a reasonable error model with a tractable likelihood, which could be a Gaussian error with unknown variance on the linear predictor functions (log-odds-ratios) of the multinomial logit -- i.e., OLS applied to the transformed observations.

answered yesterday

nanoman

43113

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402926%2fpredict-a-vector-of-values-with-constraints%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf

The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.

answered yesterday

AlexK

2108

$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday

$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday

$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday

$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday

add a comment |

https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf

The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.

answered yesterday

AlexK

2108

$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday

$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday

$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday

$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday

add a comment |

https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf

The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.

answered yesterday

AlexK

2108

https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf

The model will be estimated using Maximum Likelihood, as is usually done for censored/truncated/limited dependent variable models, not using OLS.

answered yesterday

AlexK

2108

answered yesterday

AlexK

2108

answered yesterday

AlexK

2108

answered yesterday

AlexK

2108

$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday

$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday

$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday

$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday

add a comment |

$begingroup$
But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.
$endgroup$
– Deleet
yesterday

$begingroup$
Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.
$endgroup$
– AlexK
yesterday

$begingroup$
Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."
$endgroup$
– nanoman
yesterday

$begingroup$
@nanoman, sure, with one small caveat, as described in the comment to this answer
$endgroup$
– AlexK
yesterday

But the values are not approximate in the above case, that's what's so odd. They are exactly summed to 1 (7 digits), but I don't see why that would be the case. In fact, when I wrote the example code, I wanted it to illustrate the problem with using the linear probability model approach, as you also allude to.

– Deleet
yesterday

Also, if you decide to use separate models, I recommend you check out Seemingly Unrelated Regression. It's appropriate (and not equivalent to OLS) when your separate regressions have correlated error terms. With this method, the equations are treated as independent from one another, except that the errors are modeled as jointly normally distributed. All equations are estimated simultaneously.

– AlexK
yesterday

Note: "SUR is in fact equivalent to OLS ... when each equation contains exactly the same set of regressors on the right-hand-side."

– nanoman
yesterday

@nanoman, sure, with one small caveat, as described in the comment to this answer

– AlexK
yesterday

add a comment |

answered yesterday

nanoman

43113

add a comment |

answered yesterday

nanoman

43113

add a comment |

answered yesterday

nanoman

43113

answered yesterday

nanoman

43113

answered yesterday

nanoman

43113

answered yesterday

nanoman

43113

answered yesterday

nanoman

43113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ttdfjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

2 Answers
2

2 Answers
2

2 Answers
2