GroupBy operation using an entire dataframe to group valuesDoes Python have a ternary conditional operator?How do I sort a dictionary by value?Using group by on multiple columnsPeak detection in a 2D arraySelect first row in each GROUP BY group?Group by in LINQConverting a Pandas GroupBy output from Series to DataFrameDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandas

What's a opened solder bridge signifies?

What do you call the action of "describing events as they happen" like sports anchors do?

Why is gun control associated with the socially liberal Democratic party?

Arrows inside a commutative diagram using tikzcd

How can this shape perfectly cover a cube?

Approach sick days in feedback meeting

Why is Skinner so awkward in Hot Fuzz?

What publication claimed that Michael Jackson died in a nuclear holocaust?

I sent an angry e-mail to my interviewers about a conflict at my home institution. Could this affect my application?

Why is it bad to use your whole foot in rock climbing

Harley Davidson clattering noise from engine, backfire and failure to start

How can I find out about the game world without meta-influencing it?

Is there a term for someone whose preferred policies are a mix of Left and Right?

How to represent jealousy in a cute way?

ISP is not hashing the password I log in with online. Should I take any action?

Boss making me feel guilty for leaving the company at the end of my internship

Parallelized for loop in Bash

Must a CPU have a GPU if the motherboard provides a display port (when there isn't any separate video card)?

Fastest way from 10 to 1 with everyone in between

What does this circuit symbol mean?

Has JSON.serialize suppressApexObjectNulls ever worked?

What's the reason for the decade jump in the recent X-Men trilogy?

Jam with honey & without pectin has a saucy consistency always

Does every chapter have to "blow the reader away" so to speak?



GroupBy operation using an entire dataframe to group values


Does Python have a ternary conditional operator?How do I sort a dictionary by value?Using group by on multiple columnsPeak detection in a 2D arraySelect first row in each GROUP BY group?Group by in LINQConverting a Pandas GroupBy output from Series to DataFrameDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandas






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








9















I have 2 dataframes like this...



np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))


I'd like to find the average of values in a for the 4 groups in b.



This...



a[b==1].sum().sum() / a[b==1].count().sum()


...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.



My expected result is



1 -0.088715
2 -0.340043
3 -0.045596
4 0.582136
dtype: float64


Thanks.










share|improve this question
























  • Can you please post some expected results? Right now I assume you need 4 values

    – BogdanC
    Jun 6 at 15:09

















9















I have 2 dataframes like this...



np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))


I'd like to find the average of values in a for the 4 groups in b.



This...



a[b==1].sum().sum() / a[b==1].count().sum()


...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.



My expected result is



1 -0.088715
2 -0.340043
3 -0.045596
4 0.582136
dtype: float64


Thanks.










share|improve this question
























  • Can you please post some expected results? Right now I assume you need 4 values

    – BogdanC
    Jun 6 at 15:09













9












9








9


1






I have 2 dataframes like this...



np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))


I'd like to find the average of values in a for the 4 groups in b.



This...



a[b==1].sum().sum() / a[b==1].count().sum()


...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.



My expected result is



1 -0.088715
2 -0.340043
3 -0.045596
4 0.582136
dtype: float64


Thanks.










share|improve this question
















I have 2 dataframes like this...



np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))


I'd like to find the average of values in a for the 4 groups in b.



This...



a[b==1].sum().sum() / a[b==1].count().sum()


...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.



My expected result is



1 -0.088715
2 -0.340043
3 -0.045596
4 0.582136
dtype: float64


Thanks.







python pandas group-by pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 6 at 16:21









cs95

152k26200270




152k26200270










asked Jun 6 at 15:03









MJSMJS

5381819




5381819












  • Can you please post some expected results? Right now I assume you need 4 values

    – BogdanC
    Jun 6 at 15:09

















  • Can you please post some expected results? Right now I assume you need 4 values

    – BogdanC
    Jun 6 at 15:09
















Can you please post some expected results? Right now I assume you need 4 values

– BogdanC
Jun 6 at 15:09





Can you please post some expected results? Right now I assume you need 4 values

– BogdanC
Jun 6 at 15:09












2 Answers
2






active

oldest

votes


















10














You can stack then groupby two Series



a.stack().groupby(b.stack()).mean()





share|improve this answer






























    6














    If you want a fast numpy solution, use np.unique and np.bincount:



    c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
    u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

    np.bincount(i, c) / cnt
    # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553])


    To construct a Series, use



    pd.Series(np.bincount(i, c) / cnt, index=u)

    1 -0.088715
    2 -0.340043
    3 -0.045596
    4 0.582136
    dtype: float64


    For comparison, stack returns,



    a.stack().groupby(b.stack()).mean()

    1 -0.088715
    2 -0.340043
    3 -0.045596
    4 0.582136
    dtype: float64



    %timeit a.stack().groupby(b.stack()).mean()
    %%timeit
    c, d = (a_.to_numpy().ravel() for a_ in [a, b])
    u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
    np.bincount(i, c) / cnt

    5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)





    share|improve this answer




















    • 3





      Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

      – user3483203
      Jun 6 at 15:24







    • 3





      @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

      – cs95
      Jun 6 at 15:25







    • 2





      You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

      – piRSquared
      Jun 6 at 15:57











    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56480318%2fgroupby-operation-using-an-entire-dataframe-to-group-values%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    10














    You can stack then groupby two Series



    a.stack().groupby(b.stack()).mean()





    share|improve this answer



























      10














      You can stack then groupby two Series



      a.stack().groupby(b.stack()).mean()





      share|improve this answer

























        10












        10








        10







        You can stack then groupby two Series



        a.stack().groupby(b.stack()).mean()





        share|improve this answer













        You can stack then groupby two Series



        a.stack().groupby(b.stack()).mean()






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jun 6 at 15:05









        WeNYoBenWeNYoBen

        140k84878




        140k84878























            6














            If you want a fast numpy solution, use np.unique and np.bincount:



            c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

            np.bincount(i, c) / cnt
            # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553])


            To construct a Series, use



            pd.Series(np.bincount(i, c) / cnt, index=u)

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64


            For comparison, stack returns,



            a.stack().groupby(b.stack()).mean()

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64



            %timeit a.stack().groupby(b.stack()).mean()
            %%timeit
            c, d = (a_.to_numpy().ravel() for a_ in [a, b])
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
            np.bincount(i, c) / cnt

            5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)





            share|improve this answer




















            • 3





              Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

              – user3483203
              Jun 6 at 15:24







            • 3





              @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

              – cs95
              Jun 6 at 15:25







            • 2





              You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

              – piRSquared
              Jun 6 at 15:57















            6














            If you want a fast numpy solution, use np.unique and np.bincount:



            c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

            np.bincount(i, c) / cnt
            # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553])


            To construct a Series, use



            pd.Series(np.bincount(i, c) / cnt, index=u)

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64


            For comparison, stack returns,



            a.stack().groupby(b.stack()).mean()

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64



            %timeit a.stack().groupby(b.stack()).mean()
            %%timeit
            c, d = (a_.to_numpy().ravel() for a_ in [a, b])
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
            np.bincount(i, c) / cnt

            5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)





            share|improve this answer




















            • 3





              Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

              – user3483203
              Jun 6 at 15:24







            • 3





              @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

              – cs95
              Jun 6 at 15:25







            • 2





              You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

              – piRSquared
              Jun 6 at 15:57













            6












            6








            6







            If you want a fast numpy solution, use np.unique and np.bincount:



            c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

            np.bincount(i, c) / cnt
            # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553])


            To construct a Series, use



            pd.Series(np.bincount(i, c) / cnt, index=u)

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64


            For comparison, stack returns,



            a.stack().groupby(b.stack()).mean()

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64



            %timeit a.stack().groupby(b.stack()).mean()
            %%timeit
            c, d = (a_.to_numpy().ravel() for a_ in [a, b])
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
            np.bincount(i, c) / cnt

            5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)





            share|improve this answer















            If you want a fast numpy solution, use np.unique and np.bincount:



            c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

            np.bincount(i, c) / cnt
            # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553])


            To construct a Series, use



            pd.Series(np.bincount(i, c) / cnt, index=u)

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64


            For comparison, stack returns,



            a.stack().groupby(b.stack()).mean()

            1 -0.088715
            2 -0.340043
            3 -0.045596
            4 0.582136
            dtype: float64



            %timeit a.stack().groupby(b.stack()).mean()
            %%timeit
            c, d = (a_.to_numpy().ravel() for a_ in [a, b])
            u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
            np.bincount(i, c) / cnt

            5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jun 6 at 16:07

























            answered Jun 6 at 15:13









            cs95cs95

            152k26200270




            152k26200270







            • 3





              Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

              – user3483203
              Jun 6 at 15:24







            • 3





              @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

              – cs95
              Jun 6 at 15:25







            • 2





              You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

              – piRSquared
              Jun 6 at 15:57












            • 3





              Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

              – user3483203
              Jun 6 at 15:24







            • 3





              @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

              – cs95
              Jun 6 at 15:25







            • 2





              You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

              – piRSquared
              Jun 6 at 15:57







            3




            3





            Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

            – user3483203
            Jun 6 at 15:24






            Great answer, worth noting that this will fail if you don't have every group from 1-n present. I think a fix would be something like f = np.ones(u.max()), and then f[u-1] = c to divide by that instead

            – user3483203
            Jun 6 at 15:24





            3




            3





            @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

            – cs95
            Jun 6 at 15:25






            @user3483203 That's true. In that case we'd have to call bincount on pd.factorize(b.values.ravel())[0] and proceed as planned!

            – cs95
            Jun 6 at 15:25





            2




            2





            You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

            – piRSquared
            Jun 6 at 15:57





            You can safeguard with return_inverse... u, i, c = np.unique(b.to_numpy(), return_inverse=True, return_counts=True); pd.Series(np.bincount(i, a.to_numpy().ravel()) / c, u)

            – piRSquared
            Jun 6 at 15:57

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56480318%2fgroupby-operation-using-an-entire-dataframe-to-group-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Grendel Contents Story Scholarship Depictions Notes References Navigation menu10.1093/notesj/gjn112Berserkeree

            Area configuration aggregation error after install Porto themeMagento 2.1 CE Installed but front/backend not loading/workingCSS not loading on page within Magento 2 pageCannot install module in Magento 2no commands defined in the “setup” namespace. in Magento2Magento 2: Static files are present but shows 404Why do i have to always run the commands to clean cache in Magento 2.1.8?Failure reason: 'Unable to unserialize value.'Error 500 after magento migrationIn production mode the site does not loadMagento 2 : Error 500 after installing

            Middle Expansion Olielle Resaix Definition: Uttering songs of triumph shouting with joy triumphant exulting Sejunction Journal 붙다 달 고급 품목 외출 The stretch trades the screeching tin. Definition: The act of speaking with a drawl a drawl Cough Sand Definition: An uproar a quarrel a noisy outbreak Shake Iron Publicize Horse House Baby 사과 Resaix Flaggy Jelly Temporary Unequaled Puppet A drop in the bucket Shrew 성격 회원 성질 미팅 The burn frames the tacky quality. Materialistic The smoke reduces the way. Yammoe Nondescript Cheek 얼굴 배 약하다 날리다 타다 The illegal country shows the iron. Help Rule Drearien Smoke Teaching Meaty Wasp Abraham Lincoln Jaws 진심 수리하다 Size Cork Idea Convert Think Lark John Lennon 거울 청소 군 추천하다 아이스크림