ApplicationPerformanceontheMITAlewifeMultiprocessor
FredericT.ChongandJohnD.Kubiatowicz
ftchong@lcs.mit.edu,kubitron@lcs.mit.edu
MITLaboratoryforComputerScience
Thispaperreportsontheperformanceofsev-eralapplicationsontheAlewifemachine,focus-ingonemergingapplicationsandevolvingarchitec-turalmechanisms.Weshowthatlow-latencymiss-handlingmechanismsforbothlocalandremoteac-cesses,suchasthoseinAlewife,coupledwithcare-fuldataplacementintheapplicationmaketheseemergingapplicationsviablecandidatesforshared-memoryparallelprocessing.Infact,wediscoverthatefficientsharedmemoryisanexcellentcom-municationmechanismforfine-grainapplications(evenintheabsenceofdatare-use),thathavelongbeenconsideredmessage-passingapplications.Notsurprisingly,wefindthatAlewifemechanismsper-formwellontraditionalcoarse-grainapplications.Weconfirmthathardwaresupportforlimitedshar-ingisadequateforabroadrangeofapplications,evenonlargenumbersofprocessors.Wealsoob-servethatmodelinglocalcachemissbehaviorisimportantformachines,suchasAlewife,wherere-motemisseshavebecomemorecompetitive.Toaccountfortheeffectoflocalmisses,weintroducetwonovelperformancemetrics,whichprovidemorerevealingresultsthanpreviouslyproposedmetrics.Weconcludethatthefine-grainedapplicationscantakeadvantageofAlewife’shighintegrationandef-ficiencytoachieveanewlevelofperformanceonscalablesharedmemorymachines.
Table1liststheapplications,ashortdescrip-tionoftheproblemeachoftheprogramssolves,andtheinputparameters.MP3D,BARNES,LO-CUS,CHOL,andWATERarefromtheSPLASHsuite[SWG92].APPBTandMGarepartoftheNASparallelbenchmarks[Bai94].Therestoftheappli-cationsareengineering-typekernelsfromtheUni-versityofRochester,MIT,andBerkeley.
Remotecachemissesfromaprocessortoanotherprocessor’scacheormemoryareanintegralpartofeverymultiprocessorapplicationstudy.ItturnsoutthatlocalcachemissesarealsoveryimportantontheAlewifemachinebecauseremotemissesareonlyaboutfivetimesasexpensiveasalocalmiss.
remoteaccesses
localaccesses5
Inessence,thisformulaassumesthatfivelocalmissesareequivalenttoaremotemiss,andaccountsfortheoverheadoflocalmissesinordertocomputetheoveralleffectofcachemisses.Wefindthismet-rictobemoreindicativeofapplicationperformance.Thismetric,whichwecallweightedmissratio,ismoreindicativeofapplicationperformancethanlo-calandremotemissratiosinisolationorcombinedwithoutweighting.
Overall,wefindthatEM3D,ICCG,andMP3Dhavelowerhitratiosthantheotherapplications.Thispartlyexplainsthelowerutilizationontheseapplications.However,lowhitratiosonlyleadtopoorprocessorutilizationwhenthereislittleactualcomputationbetweenmemoryreferences.Ourfullpaperanalyzestheamountofcomputationinbe-tweencachemisses.Incombinationwithcachehitratios,thesetwometricsallowustodeterminetheeffectivegranularityoftheapplications.
Thefullpaperandrelateddocumentsareavailablefromhttp://www.ai.mit.edu/people/ftchong/
References
[Bai94]D.Baileyetal.TheNASParallelBenchmarks.Techni-calReportRNR-94-007,NASAAmesResearchCenter,March1994.
[SWG92]JaswinderPalSingh,Wolf-DietrichWeber,and
AnoopGupta.SPLASH:Stanfordparallelapplica-tionsforshared-memory.ComputerArchitectureNews,20(1):5–44,March1992.
1
ProgramMP3DBARNESCHOLWATERMGGAUSS
CGRID
Description
Simulatesrarefiedfluidflow
18000particles,6iterations
Simulatesmovementofbodiesundergravitationalforces
3817wires
Choleskyfactorizationofasparsematrix
(order3948,56934floats)
Simulatesmovementofwatermolecules
20
3DPoissonsolverusingmulti-gridtechniques
20000nodes,20%remoteneighbors
UnblockedGaussianelimination
80000floats
Straightforward2Dsuccessiveover-relaxation
Kintegers
Preconditionedconjugategradientsparsesolver
(order11948,149090doubles)
20
20floats
ICCG
Table1:ApplicationsandKernels
1.0Hit Ratio0.50.0em3diccggaussmp3dmmp3dcgridmg
locuscholappbtfftbarnesmsortwaterLocal Accesses
1.0Hit Ratio0.50.0em3dmsortmp3dfftcgridwaterbarnesmmp3d
choliccgappbtlocusgaussmgRemote Accesses
25RatioRatio1.51.00.50.020151050em3diccgcgridmsort
mggaussmmp3dmp3dfftwaterappbt
locusbarnescholRatio of Remote to Local Misses
1.0Ratio of Remote to Local Misses
Hit Ratio0.50.0em3dmp3diccgbarnesmmp3dcholcgrid
fftgausswaterappbtlocusmgmsortWeighted Total of Accesses
Figure1:Cachehitratiosofapplicationssortedbyaverage.Barsarefor1,2,4,8,16,and32processorsforeachapplication.ICCGismissinga1-processorbarbecauseitsdatasetdoesnotfitonasingleAlewifeprocessor.