您的当前位置：首页 Application performance on the mit alewife multiprocessor

Application performance on the mit alewife multiprocessor

来源：五一七教育网

ApplicationPerformanceontheMITAlewifeMultiprocessor

FredericT.ChongandJohnD.Kubiatowicz

ftchong@lcs.mit.edu,kubitron@lcs.mit.edu

MITLaboratoryforComputerScience

Thispaperreportsontheperformanceofsev-eralapplicationsontheAlewifemachine,focus-ingonemergingapplicationsandevolvingarchitec-turalmechanisms.Weshowthatlow-latencymiss-handlingmechanismsforbothlocalandremoteac-cesses,suchasthoseinAlewife,coupledwithcare-fuldataplacementintheapplicationmaketheseemergingapplicationsviablecandidatesforshared-memoryparallelprocessing.Infact,wediscoverthatefﬁcientsharedmemoryisanexcellentcom-municationmechanismforﬁne-grainapplications(evenintheabsenceofdatare-use),thathavelongbeenconsideredmessage-passingapplications.Notsurprisingly,weﬁndthatAlewifemechanismsper-formwellontraditionalcoarse-grainapplications.Weconﬁrmthathardwaresupportforlimitedshar-ingisadequateforabroadrangeofapplications,evenonlargenumbersofprocessors.Wealsoob-servethatmodelinglocalcachemissbehaviorisimportantformachines,suchasAlewife,wherere-motemisseshavebecomemorecompetitive.Toaccountfortheeffectoflocalmisses,weintroducetwonovelperformancemetrics,whichprovidemorerevealingresultsthanpreviouslyproposedmetrics.Weconcludethattheﬁne-grainedapplicationscantakeadvantageofAlewife’shighintegrationandef-ﬁciencytoachieveanewlevelofperformanceonscalablesharedmemorymachines.

Table1liststheapplications,ashortdescrip-tionoftheproblemeachoftheprogramssolves,andtheinputparameters.MP3D,BARNES,LO-CUS,CHOL,andWATERarefromtheSPLASHsuite[SWG92].APPBTandMGarepartoftheNASparallelbenchmarks[Bai94].Therestoftheappli-cationsareengineering-typekernelsfromtheUni-versityofRochester,MIT,andBerkeley.

Remotecachemissesfromaprocessortoanotherprocessor’scacheormemoryareanintegralpartofeverymultiprocessorapplicationstudy.ItturnsoutthatlocalcachemissesarealsoveryimportantontheAlewifemachinebecauseremotemissesareonlyaboutﬁvetimesasexpensiveasalocalmiss.

remoteaccesses

localaccesses5

Inessence,thisformulaassumesthatﬁvelocalmissesareequivalenttoaremotemiss,andaccountsfortheoverheadoflocalmissesinordertocomputetheoveralleffectofcachemisses.Weﬁndthismet-rictobemoreindicativeofapplicationperformance.Thismetric,whichwecallweightedmissratio,ismoreindicativeofapplicationperformancethanlo-calandremotemissratiosinisolationorcombinedwithoutweighting.

Overall,weﬁndthatEM3D,ICCG,andMP3Dhavelowerhitratiosthantheotherapplications.Thispartlyexplainsthelowerutilizationontheseapplications.However,lowhitratiosonlyleadtopoorprocessorutilizationwhenthereislittleactualcomputationbetweenmemoryreferences.Ourfullpaperanalyzestheamountofcomputationinbe-tweencachemisses.Incombinationwithcachehitratios,thesetwometricsallowustodeterminetheeffectivegranularityoftheapplications.

Thefullpaperandrelateddocumentsareavailablefromhttp://www.ai.mit.edu/people/ftchong/

References

[Bai94]D.Baileyetal.TheNASParallelBenchmarks.Techni-calReportRNR-94-007,NASAAmesResearchCenter,March1994.

[SWG92]JaswinderPalSingh,Wolf-DietrichWeber,and

AnoopGupta.SPLASH:Stanfordparallelapplica-tionsforshared-memory.ComputerArchitectureNews,20(1):5–44,March1992.

ProgramMP3DBARNESCHOLWATERMGGAUSS

CGRID

Description

Simulatesrareﬁedﬂuidﬂow

18000particles,6iterations

Simulatesmovementofbodiesundergravitationalforces

3817wires

Choleskyfactorizationofasparsematrix

(order3948,56934ﬂoats)

Simulatesmovementofwatermolecules

3DPoissonsolverusingmulti-gridtechniques

20000nodes,20%remoteneighbors

UnblockedGaussianelimination

80000ﬂoats

Straightforward2Dsuccessiveover-relaxation

Kintegers

Preconditionedconjugategradientsparsesolver

(order11948,149090doubles)

20ﬂoats

ICCG

Table1:ApplicationsandKernels

1.0Hit Ratio0.50.0em3diccggaussmp3dmmp3dcgridmg

locuscholappbtfftbarnesmsortwaterLocal Accesses

1.0Hit Ratio0.50.0em3dmsortmp3dfftcgridwaterbarnesmmp3d

choliccgappbtlocusgaussmgRemote Accesses

25RatioRatio1.51.00.50.020151050em3diccgcgridmsort

mggaussmmp3dmp3dfftwaterappbt

locusbarnescholRatio of Remote to Local Misses

1.0Ratio of Remote to Local Misses

Hit Ratio0.50.0em3dmp3diccgbarnesmmp3dcholcgrid

fftgausswaterappbtlocusmgmsortWeighted Total of Accesses

Figure1:Cachehitratiosofapplicationssortedbyaverage.Barsarefor1,2,4,8,16,and32processorsforeachapplication.ICCGismissinga1-processorbarbecauseitsdatasetdoesnotﬁtonasingleAlewifeprocessor.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文