PRAM
Oficialni slidy extrahovane z PDF. Text je automaticky vytazeny pres pdftotext -layout; u obrazkovych nebo diagramovych stran kontroluj originalni PDF.
Metadata
| Pole | Hodnota |
|---|---|
| Zdroj | PDA06_PRAM_MNG.pdf |
| Soubor | PDA06_PRAM_MNG.pdf |
| Stran | 106 |
| Page size | 595.276 x 841.89 pts (A4) |
Pouziti
- Pouzivej jako oficialni oporu pro definice, algoritmy, terminologii a slozitosti.
- Pokud je strana zalozena hlavne na obrazku, cituj stranu a zkontroluj originalni PDF.
- Pro exam historii porad preferuj
knowledge/exams/**/term-*.md; slidy slouzi jako vykladovy zdroj.
Text po stranach
Strana 1
PRL06 - MNG
Model 2019
Paralelní
a distribuované
algoritmy
Část 6 PRAM, Suma prefixů
Post 19/20
Souhrnné materiály
Petr Hanáček
Ver 0.1
PDA0x0 Slide 5Strana 2
Paralelní a distribuované algoritmy
Paralelní a distribuované algoritmy
Upd 2005, Upd 2007
Upd 2008/9, Post 2009/10
Post 2010/11
Post 19 MNGprep
Koro version
PDA 6
PRAM, Suma prefixů
.
PRL 1
6
Učebnice
• [Reif] Reif, J: Synthesis of Parallel Algorithms, Morgan
Kaufmann, 1993, ISBN:155860135X
• Kapitoly:
– 20 The Complexity of Computation on the Parallel Random Access Machine
» Zajímavé jsou pro nás strany 2-6
– 1 Prefix Sums and Their Applications
» Zajímavé jsou pro nás strany 2-13, 17-20, 23-24
PRL 2
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 1
zakázáno.Strana 3
Paralelní a distribuované algoritmy
Model
Model je rozhraní,
které odděluje
aplikaci (high-level)
Aplikace
od architektury (low
level)
Poskytuje
operace
MODEL
Požaduje
implementaci
Architektura
PRL 3
PRAM
• Parallel Random Access Machine
Synchronní model paralelního výpočtu,
procesory komunikují sdílenou pamětí
MOV 1,4
Skládá se z p procesorů RAM
ADD 3,2
CMPJ 1, 4, 8
Procesor
Aditivní (logické) operace PROCESOR
[Multiplikativní operace]
Podmíněné skoky na základě porovnání
[Adresování]
1
Paměť (sada registrů) REGISTRY
Neomezený počet n
[Neomezená délka slova] - není příliš RAM
vhodná, už se nepoužívá
PRL 4
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 2
zakázáno.Strana 4
Paralelní a distribuované algoritmy
PRAM
Je to alternativní model k paralelnímu Turingovu
stroji
• Všechny RAMy jsou řízeny jedním společným
programem
Společné
řízení
LM1 RAM1 LM2 RAM2 LMn RAMn
Společná
paměť
PRL 5
• Definice
– PRAM je synchronní model paralelního výpočtu, ve kterém procesory
komunikují sdílenou pamětí. Skládá se z p procesorů P1...Pn, jež jsou
RAM.
– Výpočet probíhá po krocích synchronně. Krok:
čtení sdílené paměti
lokální operace
zápis do sdílené paměti
– Procesory mohou během kroku provádět různé operace a mohou
používat svůj index (unikátní číslo procesoru).
• Teorém
– Každý problém, řešitelné PRAMem s p procesory v t krocích je také
řešitelné p’p procesory v O(t p/p’) krocích.
• Důkaz
– Původních p procesorů je rozděleno do p’ skupin o velikosti max p/p’.
Každý z p’ procesorů simulujícího stroje se stará o jednu skupinu. Pro
simulaci jednoho kroku původního stroje každý z p’ procesorů simuluje
čtecí fázi procesorů své skupiny, pak lokální fázi a na konec zápisovou
fázi.
PRL 6
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 3
zakázáno.Strana 5
Paralelní a distribuované algoritmy
Omezení přístupu ke sdílené paměti
Je dovoleno současné čtení ?
Je dovoleno současné zapisování ?
• Čtyři různé architektury přístupu k paměti
» EREW - exclusive read, exclusive write
» ERCW - exclusive read, concurrent write - nemá opodstatnění, nepoužívá se
» CREW - concurrent read, exclusive write
» CRCW - concurrent read, concurrent write - splňují pouze některé architektury, technicky
obtížněji realizovatelný
• U architektury CRCW je třeba specifikovat řešení
zápisových konfliktů:
COMMON - všechny zapisované hodnoty musí být shodné
ARBITRARY - zapisované hodnoty mohou být různé, zapíše se libovolná z nich
PRIORITY - procesory mají pevné priority, zapíše se hodnota, zapisovaná procesorem s
nejvyšší prioritou
– Relace A B - algoritmus, který běží na architektuře B, běží beze změn i
na A. ( A je stejně tolerantní nebo tolerantnější k zápisovým konfliktům)
• Pak platí:
– PRIORITY ARBITRARY COMMON CREW PRAM EREW PRAM
PRL 7
DALŠÍ ALGORITMY
PRL 8
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 4
zakázáno.Strana 6
Paralelní a distribuované algoritmy
Broadcast
Hodnota, uložená v paměti, má být rozšířena mezi N
procesory
pro CREW a CRCW PRAM je triviální řešení v konstantním čase
pro EREW je třeba simulovat současné čtení
• Funkce
– P1 přečte D a zpřístupní jej P2.
– P1 a P2 jej zpřístupní paralelně P3 a P4.
– P1, P2, P3 a P4 jej zpřístupní paralelně P5, P6, P7 a P8..
– …
Algoritmus
D - hodnota, která se má rozšířit mezi N procesory
A[1..N] - pole ve sdílené paměti o délce N
procedure BROADCAST(D, N, A)
(1) A[1] = D;
(2) for i = 0 to (log N-1) do
for j = 2i+1 to 2i+2-1 do in parallel
A[j] = A[j-2i]
endfor
endfor
PRL 9
x x L L L
L
x x x x
x P1 x P2 x P5
x x x
x P3 x x
x P6
x x
x P4
x
x P7
x
x
x P8
x
(a) (b) (c) (d)
Analýza
SEQ t(n) = O(n)
EREW t(n) = O(log n)
CREW, CRCW t(n) = O(c)
PRL 10
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 5
zakázáno.Strana 7
Paralelní a distribuované algoritmy
Suma prefixů
All-prefix-sums, allsums, scan
Jeden ze základních kamenů stavby paralelních
algoritmů
• Definice
– Suma prefixů je operace, jejímž vstupem je binární asociativní operátor
a uspořádaná posloupnost prvků
[a0, a1, ..., an-1]
a která vrací posloupnost
[a0, (a0 a1), ..., (a0 a1 ... an-1)]
• Např. jestliže je operátor sčítání a vstupní posloupnost
[3 1 7 0 4 1 6 3]
• pak výsledek sumy prefixů je
[3 4 11 11 15 16 22 25].
PRL 11
Suma prefixů
• Některá použití:
Vyhodnocování polynomů
Sčítání binárních čísel v hardware
Lexikální porovnávání řetězců
Lexikální analýza
Implementace radix-sortu, quick-sortu
Rušení označených prvků v poli
Vyhledávání regulárních výrazů (grep)
Implementace některých stromových operací
Označování komponent ve dvourozměrných obrázcích
Některé jiné názvy
V knihovně MPI: MPI_scan
V programu MATLAB: y=cumsum(x)
PRL 12
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 6
zakázáno.Strana 8
Paralelní a distribuované algoritmy
Lze použít jakýkoli asociativní operátor +
Asociativita:
(a +b)+c = a+ (b +c)
Sum (+) Konkatenace
Product (*) Vstup: Řetězce
AND “”
Max OR “”
Min MatMul
XOR
Vstup: čísla Vstup: Matice
Vstup: Bity
(Boolean)
PRL 13
Sekvenční řešení
Sekvenční algoritmus
procedure allsums (Out, In)
i=0
sum = In[0]
while i<length do
i = i+1
sum = sum + In[i]
Out[i] = sum
endwhile
• Časová složitost je t(n) = O(n)
PRL 14
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 7
zakázáno.Strana 9
Paralelní a distribuované algoritmy
Scan, prescan, reduce
• Definice Inclusive scan
– Operace scan je suma prefixů
• Definice Exclusive scan
– Operace prescan má jako vstup binární asociativní operátor ,
neutrální prvek I a vektor
[a0, ..., an-1] a vrací vektor
[I, a0, (a0 a1), ..., (a0 a1 ... an-2)]
• Definice
– Operace reduce má stejný vstup jak scan a vrací hodnotu a0 a1
... an-1
SCAN, ALLSUMS
I, a0, a0 + a1, ........, a0 + a1 + ... + an-2, a0 + a1 + ... + an-1
PRESCAN REDUCE
PRL 15
Paralelní suma prefixů - Reduce
Reduce
Reduce může být spočtena pomocí stromu procesorů, za
předpokladu, že je asociativní (nemusí být komutativní)
Algoritmus
for j = 0 to log n - 1 do
for i = 0 to n - 1 step 2j+1 do in parallel
a[i+2j+1-1] = a[i + 2j -1] + a[i + 2j+1 - 1]
end for
end for
PRL 16
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 8
zakázáno.Strana 10
Paralelní a distribuované algoritmy
PRL 17
Reduce
• Diskuse
– strom má výšku log n, pro každý pár prvků je nutný 1 procesor
– t(n) = O(log n) c(n) = O(n.log n)
– p(n) = n/2 což není optimální
• máme-li méně procesorů N<n, každý procesor
provede sekvenční reduce pro svůj kousek
posloupnosti o délce n/N a výsledky se zpracují
pomocí reduce
PRL 18
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 9
zakázáno.Strana 11
Paralelní a distribuované algoritmy
+-reduce pro n > p
• Algoritmus
for each processor i do in parallel
sum[i] = a[(n/N).i]
for j = 1 to n/N do
sum[i] = sum[i] + a[(n/N).i + j]
result = +-reduce(sum)
•
Pak
– t(n) = n/N + log N = O(n/N + log N)
– je-li log N < n/N, pak
t(n) = O(n/N) a
c(n) = O(n/N).N = O(n) což je optimální
PRL 19
Prescan a scan
Uvedeme prescan, scan se získá posunem doleva, přidáním reduce
• Algoritmus
– (i) UpSweep algoritmus totožný s reduce, ale každý uzel si pamatuje
mezisoučet
– (ii) DownSweep
» kořenu se přiřadí neutrální prvek I
» nyní se provádí log n kroků (každá úroveň jednou), počínaje kořenem, směrem k listům a
v každém kroku procesory v té úrovni pracují paralelně:
» uzel dá svému P synovi svoji hodnotu hodnotu L-syna a L-synovi dá svoji hodnotu
+prescan
na stromu
4 11
PRL 20
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 10
zakázáno.Strana 12
Paralelní a distribuované algoritmy
+-prescan na architektuře PRAM
Procedure down-sweep (a)
a[n-1] = 0
for d = (log n) - 1 downto 0 do
for i = 0 to n-1 step 2d+1 do in parallel
t = a[i + 2d - 1]
a[i + 2d - 1] = a[i + 2d+1 - 1] //left child = parent
a[i + 2d+1 - 1] = t + a[i + 2d+1 - 1]//right child = parent+right
end for
end for
PRL 21
• Teorém: po dokončení downsweep obsahuje každý uzel
sumu hodnot všech listů, jež ho předcházejí
• Důkaz indukcí:
kořen nepředcházejí žádné listy, takže jeho správná hodnota je neutrální
prvek
L-syn je předcházen týmiž uzly, jako samotný uzel (A), za předpokladu,
že otec má správnou hodnotu, stačí ji předat L-synovi
0
pravý syn je předcházen V
listy A, B, proto jeho hodnota
je hodnota otce hodnota L[V] R[V]
levého syna
A B
PRL 22
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 11
zakázáno.Strana 13
Paralelní a distribuované algoritmy
• Analýza
• Složitost je stejná, jako u reduce
• Zlepšení ceny:
– t(n) = O(n/N)
– c(n) = O(n), za předpokladu, že log N < n/N což je
optimální
PRL 23
Suma prefixů na mřížce
• Zpracuje n prvků v √n krocích
PRL 24
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 12
zakázáno.Strana 14
Paralelní a distribuované algoritmy
Suma prefixů na hyperkostce
• Zpracuje n prvků v log n krocích
PRL 25
Suma prefixů na AVX
PRL 26
Credits: https://www.wikiwand.com/en/Prefix_sum
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 13
zakázáno.Strana 15
Paralelní a distribuované algoritmy
Packing problem
• Máme k vstupních položek, rozmístěných v poli o n
pozicích, k<n
• Cílem je vytvořit výstupní
pole, kde položky zaujímají
prvních k pozic
• Algoritmus
– Spočteme pole binárních příznaků, 1-položka existuje, 0-neexistuje
– Spočteme +-scan tohoto pole
– Přesuneme položky na správné pozice
PRL 27
Packing problem – AVX
• Instrukce
• VCOMPRESSPS ymm1/m256{k1}{z}, ymm2
PRL 28
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 14
zakázáno.Strana 16
Paralelní a distribuované algoritmy
Viditelnost
• Problém
– Je dána matice terénu ve formě matice nadmořských výšek a
pozorovací bod X(místo pozorovatele), zjistěte, které body podél
paprsku vycházejícího z místa X jsou viditelné
• Řešení
– Bod je viditelný, pokud žádný bod mezi pozorovatelem a jím
nemá větší vertikální úhel.
– (i) vytvoří se vektor výšek bodů podél pozorovacího paprsku
– (ii) vektor výšek se přepočítá na vektor úhlů
– (iii) pomocí max_prescan se spočte vektor maximálních úhlů
pro zjištění viditelnosti bodu stačí určit jeho úhel a porovnat s
maximem.
• Analýza
– t(n) = O(n/N + log N) na EREW PRAM
PRL 29
procedure LINE_OF_SIGHT
for each index i do in parallel
angle[i] = arctan ((alt[i] - alt[0])/i)
max_prev_angle = max_prescan(angle)
for each index i do in parallel
if (angle[i] > max_prev_angle[i]) then result[i] = visible
else result[i] = unvisible endif
endfor
PRL 30
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 15
zakázáno.Strana 17
Paralelní a distribuované algoritmy
Carry Look Ahead Parallel Binary Adder
• Vstup: Dvě n-bitová binární čísla X = xn-1..x0 a
Y = yn-1..y0
• Cíl: Spočítat Z = zn-1..z0 = X + Y pomocí log n kroků
• Musíme předvypočítat všechny bity přenosu cn-1..c0
abychom mohli přímo spočítat zi = xi + yi + ci I=0..n-1
x0 y0 c0/cin
xn-1 yn-1 x1 y1
ci-1
...
c1
c2
Cout/cn
si-1 s1 s0
PRL 31
Carry Look Ahead Parallel Binary Adder
• Algoritmus:
– Spočteme pole D = dn-1…d0 kde di {propagate, stop, generate}
– for i=0 to n-1 do in parallel
if (xi=1) and (yi=1) then di = g
else if (xi=0) and (yi=0) then di = s
else di = p
endfor
– Spočteme -scan pole D a tím dostaneme všechny bity přenosu
v logaritmickém čase
PRL 32
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 16
zakázáno.Strana 18
Paralelní a distribuované algoritmy
Funkce (někdy značená ¢)
• p = p’ and p’’
• g = g’’ or (g’ and p’’)
• Přenos je generován pokud:
– Levý blok generuje
– Pravý blok generuje a levý propaguje
PRL 33
Radix sort
• Bitový radix sort (radix = 2)
– V každém kroku se bere v úvahu 1 bit klíče a pomocí operace
split se prvky s nulovým bitem přemístí na začátek řazeného pole
řazených čísel a s jedničkovým bitem na konec
procedure SPLIT_RADIX_SORT(A, number_of_bits)
for i = 0 to (number_of_bits - 1) do
A = split(A, A<i>)
PRL 34
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 17
zakázáno.Strana 19
Paralelní a distribuované algoritmy
Radix sort - Operace split
• Jak udělat split? - sekvenční složitost je O(n)
• Idea
– pro každý prvek určíme správnou pozici a v konstantním čase
přemístíme (EREW)
• Postup
pro prvky s nulovým bitem se jejich pozice získá provedením - prescan na
invertované pole bitů
pro prvky s jedničkovým bitem provedu scan na reverzované pole bitů (tj.
od konce) a výsledek se odečte od n.
Analýza
split má stejnou složitost jako scan
Radix sort:
t(n) = O(n/N + log N).O(log n) = O(n/N.log n + log n . log N)
PRL 35
Operace split
procedure split(A, Flags)
I-down = +-prescan(not(Flags))
I-up = n - +-scan(reverse-order(Flags))
for i=0 to n-1 do in parallel
if (Flags[i]) Index[i] = I-up[i]
else Index[i] = I-down[i] endif
endfor
result = permute (A, Index)
7 7 8
PRL 36
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 18
zakázáno.Strana 20
Paralelní a distribuované algoritmy
Segmentovaný scan
• Definice
– Segmentovaná suma prefixů je operace, jejímž vstupem je
binární asociativní operátor , uspořádaná posloupnost prvků
[a0, a1, ..., an-1]
uspořádaná posloupnost příznaků
[f0, f1, ..., fn-1]
a která vrací posloupnost
[s0, s1, ..., sn-1]
kde v posloupnosti s jsou sumy prefixů přes jednotlivé segmenty,
kde hranice segmentu je dána hodnotou 1 v poli příznaků f
• Příklad :
– a = [5 1 3 4 3 9 2 6]
– f = [1 0 1 0 0 0 1 0]
– Segmented +_SCAN = [5 6 3 7 10 19 2 8]
– segmented max_SCAN = [5 5 3 4 4 9 2 6]
PRL 37
Quicksort
• Jeden z prvků se vybere jako pivot (medián, náhodně, první), prvky
se rozdělí do 3 skupin (menší, rovné, větší než pivot) a pro každou
skupinu se rekurzivně volá quicksort
• Použije se segmentovaný scan a každá skupina bude ve svém
vlastním segmentu
• Algoritmus
– (1) zkontroluj, zda už prvky nejsou seřazené. Každý procesor se podívá, zda
předchozí procesor má menší, nebo stejnou hodnotu. S výsledky se provede and-
reduce
– (2) v každém segmentu najdi pivot a předej jej ostatním procesorům v segmentu.
Vybírá-li se jako pivot 1. prvek, lze použít segmented-copy-scan, kde binární
operátor copy vrací 1. ze svých 2 parametrů:
a copy(a, b)
» To má za následek rozšíření pivota v celém segmentu (lze také pivota vybírat jinak)
– (3) v každém segmentu porovnej prvky s pivotem a rozděl segment na 3 části (=,
<, >). Po rozdělení se použije modifikovaný split z radix-sortu.
– (4) v rámci každého segmentu vlož dodatečné příznaky, které rozdělí segment na
3 segmenty. Každý procesor se podívá na předchozí prvek a pozná, zda je na
začátku segmentu.
– (5) jdi na krok (1)
PRL 38
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 19
zakázáno.Strana 21
Paralelní a distribuované algoritmy
klíč 6,4 9,2 3,4 1,6 8,7 4,1 9,2 3,4
flags 1 0 0 0 0 0 0 0
Pivots 6,4 6,4 6,4 6,4 6,4 6,4 6,4 6,4
F == > < < > < > <
split 3,4 1,6 4,1 3,4 6,4 9,2 8,7 9,2
flags 1 0 0 0 1 1 0 0
pivots 3,4 3,4 3,4 3,4 6,4 9,2 9,2 9,2
• Analýza
– Každá iterace má konstantní počet operací scan
– Při vhodné volbě pivotů skončí algoritmus po O(log n) krocích,
takže složitost je:
– t(n) = O(n/N + log N).O(log n) = O(n/N.log n + log N. log n)
– c(n) = O(n.log N + N. log n. log N)
pro dostatečně malé N optimální
PRL 39
KONEC
PRL 40
Petr Hanáček Umístění tohoto dokumentu na jakýkoli (i neveřejný) server je 20
zakázáno.Strana 22
21
The Complexity of
Computation on the
Parallel Random Access
Machine
Faith E. Fich
Department of Computer Science
University of Toronto
10 King's College Road
Toronto, Ontario
CANADA M5S 1A4
1Strana 23
2 Chapter 21. The Complexity of Computation on the PRAM
This chapter discusses the Parallel Random Access Machine (PRAM), various
parameters of this model, under what circumstances these parameters aect
its computational power, and how they do so. It also surveys techniques that
have been used to obtain lower bounds for specic problems on PRAMs.
21.1
The PRAM Model
The PRAM is a synchronous model of parallel computation in which
processors communicate via a shared memory. It consists of m shared memory
cells M1 : : : Mm and p processors P1 : : : Pp . Each processor is a random
access machine (RAM) with a private local memory. During every step of a
computation, a processor may read from one shared memory cell, perform a
local operation, and write to one shared memory cell. Dierent processors
may execute dierent operations during a step and may make reference to
their own index. Reads, local operations, and writes are viewed as occurring
during three separate phases. This simplies analysis and only changes the
running time of an algorithm by a constant factor.
An input x consists of n values x1 : : : xn. At the beginning of the
computation, these values are located in shared memory cells M1 : : : Mn ,
respectively, provided n m. If there are n m output values, they appear
0
in the rst n shared memory cells at the end of the computation When
0
the number of shared memory cells is too small, the input values can be
distributed approximately equally among the processors. Another possibility
is to have a separate, read-only memory containing the input values VW85,
LY89, FLRY89]. In this case, each processor is also allowed to read from one
of the n read-only memory cells during each computation step.
EXERCISE 21.1
Explain why it doesn't matter whether the input is initially located in
the shared memory, the processors' local memories, or a separate read-
only memory, provided the amount of shared memory is large enough to
contain the input (i.e. m n).
If the amount of shared memory m is small, then where the input is
located is more important. For example, determining whether the input con-
tains two consecutive variables with value 1 requires (n=m) steps when the
input is distributed among the processors' local memories (Exercise 21.43).Strana 24
21.1 The PRAM Model 3
However, when the input is in a separate read-only memory, this computation
can be done signicantly faster (Exercise 21.9).
Limiting the amount of shared memory in a PRAM corresponds to re-
stricting the amount of information that can be communicated between pro-
cessors in one step. For example, a set of processors connected to one another
by a single bus may be viewed as a PRAM with one shared memory cell.
THEOREM 21.1
Any problem that can be solved by a PRAM using p processors in t steps
can also be solved using p p processors in O(tp=p ) steps.
0 0
PROOF
The original p processors are partitioned into p groups of size at most
0
dp=p e. Each of the p processors in the simulating machine is associated
0 0
with one of these groups. To simulate one step of the original computa-
tion, each of the p processors sequentially simulates the read and local
0
computation phases performed by all the processors in its group and
then sequentially simulates their write phases.
EXERCISE 21.2
ACF90] Give an example of a problem that can be solved in one step on
a PRAM with p processors, but requires 2dp=p e ; 1 steps on a PRAM
0
with p processors for 1 p < p.
0 0
EXERCISE 21.3
Prove that any problem that can be solved by a PRAM using m shared
memory cells and p processors in t steps can also be solved using m m
0
shared memory cells and maxfp m g processors in O(tm=m ) steps.
0 0
EXERCISE 21.4
Goo89] A forking PRAM is a PRAM in which a new processor is created
when an existing processor executes a fork operation. In addition to
creating the new processor, this operation species the task that the
new processor is to perform (starting at the next time step). Initially
there is one processor.
Prove that if a forking PRAM algorithm uses t steps and p processors,
then this algorithm can be implemented on a PRAM using O(t logp)
steps and O(p= logp) processors. Explain explicitly how work is allo-
cated to the PRAM processors.Strana 25
4 Chapter 21. The Complexity of Computation on the PRAM
One consequence of Theorem 21.1 is that the processor-time product
p t of any algorithm solving a problem is at least a constant factor times
the sequential time complexity of the problem. Thus, a lower bound on the
sequential time complexity gives a lower bound on the number of processors
needed for a PRAM to solve the problem in a given amount of time, as well
as a lower bound on the amount of time needed for a PRAM to solve the
problem using a given number of processors.
A PRAM algorithm is optimal if it is not possible to simultaneously
improve both the time and the number of processors by more than a constant
factor. If a PRAM algorithm solves a problem in t steps using p processors,
where p t is within a constant factor of the sequential time complexity of
the problem, then it is called ecient KRS90]. All ecient algorithms are
optimal, but not all optimal algorithms are ecient. For many problems, it
is interesting and important to nd the fastest possible ecient algorithm.
Some people call a parallel algorithm optimal if its processor-time prod-
uct is within a constant factor of the running time of the fastest known se-
quential algorithm solving the same problem. Unless there is a matching
sequential lower bound for the problem, diculties may arise with this def-
inition. Specically, if a faster sequential algorithm is found, suddenly, the
parallel algorithm is no longer optimal. Furthermore, this denition of opti-
mality may preclude calling an algorithm optimal even when it is provably
impossible to simultaneously improve both the time and the number of pro-
cessors. For example, PARITY has linear sequential time complexity, but an
exponential number of processors are required by any PRAM that computes
it in constant time (Theorem 21.33).
The PRAM is a natural generalization of the unit cost RAM CR73,
AHU74], a commonly used model of sequential computation. Numerous par-
allel algorithms have been designed for the PRAM because it is a simple,
precise model in which parallel algorithms can be expressed using a high level
language. The PRAM often corresponds to the programmer's view of par-
allel computation, ignoring lower level architectural details such as memory
organization, routing, memory contention, and synchronization. In essence,
PRAM programmers do not have to be concerned with how communication
is accomplished, but only what is to be communicated where. PRAMs can be
simulated by more realistic models MV84, UW87, Ran87, AHMP87, HP89]
and some real parallel computers Sch80, Smi90]. Thus, programs written for
the PRAM can be compiled into programs for these machines. Although some
of the assumptions (such as constant time access to shared memory) are notStrana 26
21.2 Restrictions on Access to Shared Memory 5
generally valid, the performance of algorithms on the PRAM can be a good
predictor of their relative performance on real machines, especially as prob-
lem sizes get large. Furthermore, because the PRAM is a powerful model, the
lower bounds obtained for it are automatically applicable to a wide variety of
less powerful, more realistic machines.
21.2
Restrictions on Access to Shared Memory
An important parameter of the PRAM model is the extent to which
concurrent access to shared memory cells is allowed. If at most one processor
can read from a single memory cell at a particular step and at most one pro-
cessor can write to a single memory cell at a particular step, then the PRAM
is called exclusive read exclusive write (EREW) LPV81]. In a concurrent
read exclusive write (CREW) PRAM FW78], any number of processors can
simultaneously read from the same memory cell, but at most one processor
can write to each memory cell at a given step. Many algorithms designed for
CREW and EREW PRAMs avoid write conicts by having each processor
own one cell to which all its writes are performed. These restricted models
are called the concurrent-read owner-write (CROW) PRAM DR86] and the
exclusive-read owner-write (EROW) PRAM FW90], respectively.
When concurrent writes are allowed, it is necessary to specify how con-
icts are resolved. Three of the most frequently used concurrent read concur-
rent write (CRCW) PRAMs are considered here. A number of others are dis-
cussed in the literature. (For example, see FRW88a, GR90, EG88, HR90].)
The COMMON model Kuc82] requires that all processors simultaneously
writing to the same memory cell write a common value. On the ARBITRARY
model Vis83], an arbitrary one of the values written to a memory cell at a
given step will appear in the cell. Any algorithm for this model must work
regardless of which value is chosen (say, by an adversary) to resolve each write
conict that arises. In the PRIORITY model Gol82], processors are assigned
xed, distinct priorities. The processor of highest priority among those that
simultaneously write to a memory cell succeeds. Without loss of generality,
we assume that lower indexed processors have higher priority.
Any algorithm that runs on ARBITRARY will run unchanged on PRI-
ORITY. Thus PRIORITY is at least as powerful as ARBITRARY. Similarly,
ARBITRARY is at least as powerful as COMMON, COMMON is at least asStrana 27
6 Chapter 21. The Complexity of Computation on the PRAM
powerful as the CREW PRAM, and the CREW PRAM is at least as powerful
as the EREW PRAM. Diagrammatically,
PRIORITY ARBITRARY COMMON CREW PRAM EREW PRAM:
It is important to understand the relationships between these models.
Specically, are some of these models strictly more powerful than others?
Can additional resources compensate for restricted access to shared memory?
Tradeos can help a computer architect choose among a number of design
alternatives or help a programmer compare algorithms written on dierent
models. A step by step simulation of a more powerful model by a less powerful
model is particularly useful because it allows the straightforward transforma-
tion of algorithms designed for the former into algorithms that can run on the
latter.
TABLE 21.1
Relationships Between PRAM Models
The fourth column contains bounds on the amount of time for the simulating
model using the indicated number of processors to simulate one step of the
original model with p processors. For the last three rows, the input is initially
distributed among the processors' local memories and both the original and
simulating models have m shared memory cells.
Original Simulating Number of Amount of Reference
Model Model Processors Time
PRIORITY EREW PRAM Thm 21.2
ARBITRARY CREW PRAM
p (log p) Thm 21.3
COMMON Thm 21.23
PRIORITY Thm 21.7
ARBITRARY
COMMON kp k(log log
log p
p;log p) Ex 21.13
PRIORITY ARBITRARY kp O log(
log log p
k +1)) Ex 21.14
CREW PRAM EREW PRAM 1 p
( log p= log log p) Ex 21.39
PRIORITY EREW PRAM Ex 21.7
ARBITRARY CREW PRAM
p+m 2
(p=m)
Ex 21.44
COMMON
PRIORITY COMMON p (log(p=m)) Thm 21.42
ARBITRARY
PRIORITY ARBITRARY p (log(p=m)) Ex 21.41
Table 21.1 summarizes some of the known relationships between the
PRAM models discussed above. The remainder of this section discusses theseStrana 28
21.2 Restrictions on Access to Shared Memory 7
and related results and the conditions under which they apply. Some of the
lower bounds are presented in Section 4.
The rst simulation is quite simple. Moreover, it contains ideas that are
used in other simulations in this section.
THEOREM 21.2 FRW88a]
One step of PRIORITY with p processors and m shared memory cells
can be simulated by an EREW PRAM in O(logp) steps with p processors
and mp shared memory cells.
PROOF
Each PRIORITY processor is simulated by a corresponding EREW
PRAM processor. For each original shared memory cell in PRIOR-
ITY, the EREW PRAM will have p shared memory cells. One of these
will contain the same sequence of values as the original cell. The other
p ; 1 cells are assumed to be initialized to 0 and will be used to resolve
conicting accesses to the cell.
To simulate the write phase, each processor must know whether it is the
leftmost processor (i.e. the processor of lowest index) wishing to write
into some cell and, if so, it can perform the write. This leads to the
denition of the following problem.
LEFTMOST WRITERS
Each processor Pi has a value in the range f0 : : : mg in its
local memory, denoting the shared memory cell that Pi wants
to access. The value 0 indicates that Pi does not want to
access any cell. Each processor Pi must determine whether it
is the leftmost processor among those with the same nonzero
value.
One way to solve this problem is to solve m simultaneous instances of
the following problem, one for each of the original shared memory cells.
LEFTMOST PRISONER
Each processor has a bit, known only to itself, which is 1 if
the processor wants to access the given memory cell. Each
processor with value 1 must determine whether it is the left-
most processor with value 1. Only processors with value 1 can
participate in the computation.Strana 29
8 Chapter 21. The Complexity of Computation on the PRAM
The name of this problem comes from imagining the processors to be
prisoners who cooperate with one another to determine the lowest num-
bered occupied prison cell. Note that all m instances can be solved
simultaneously because every processor participates in at most one com-
putation.
The additional p ; 1 memory cells associated with each original memory
cell are used to solve its LEFTMOST PRISONER problem. They are
viewed as representing the internal nodes of a binary tree with p leaves
(one corresponding to each processor) and depth dlog2 pe.
All processors with value 1 are considered to be initially located at the
corresponding leaves of the tree. The internal nodes of the tree are
processed one level at a time, starting from the bottom. Each processor
located at the right child of a node in the current level reads the value
stored at the left child. When this value is 0, there is no processor located
at the node's left child. In this case, the processor proceeds to the node
and writes the value 1 there. Simultaneously, each processor located at
the left child of a node in the current level proceeds to the node and
writes the value 1 there. The processor that reaches the root of the tree
is the leftmost processor that wants to access the given shared memory
cell. Hence, it can perform the write to the cell. (An example appears
in Figure 21.1.) Finally, processors retrace their steps back down the
tree, erasing the values they have written at the nodes.
To simulate the read phase, it is sucient to choose one processor to read
each desired memory cell and distribute its contents to all interested pro-
cessors. This can also be accomplished by performing the LEFTMOST
PRISONER algorithm and having the processor that reaches the root
read the given shared memory cell. As the processors go back down the
tree, this value is written at each internal node that a processor reaches
and is then erased. A processor waiting at a node can read the value
from its parent at the appropriate time.
A simplied version of this algorithm can be performed on COMMON
without using more shared memory than the original machine.
EXERCISE 21.5
FRW88a] Prove that one step of PRIORITY with p processors and m
shared memory cells can be simulated by COMMON in O(logp) steps
with p processors and m shared memory cells.Strana 30
21.2 Restrictions on Access to Shared Memory 9
0P
PP
PP
0 0
0
Q
Q
0
0
Q
Q
k
1 P7
0k A0k 0k A1kP 1k P AA0k
4 5
0P
PP
PP
0 0
0
Q
Q
1 P4
1 P5
Q
Q
k
1 P7
0k AA0k 0k AA1k 1k AA0k
0P
PP
PP
1 P4 1 P5
0
Q
Q
1
1
Q
Q
k
1 P7
0k AA0k 0k AA1k 1k AA0k
P4
1P
PP
PP
1 1 P5
0
Q
Q
1
1
Q
Q
k
1 P7
0k AA0k 0k AA1k 1k AA0k
FIGURE 21.1
An example illustrating the processors going up the tree in the LEFTMOST PRIS-
ONER algorithm in the proof of Theorem 21.2. Here processors P4, P5, and P7
want to access the shared memory cell and processors P1 , P2 , P3 and P6 do not.Strana 31
10 Chapter 21. The Complexity of Computation on the PRAM
The following simulation is more complicated than the simulation in
Theorem 21.2, but uses fewer shared memory cells. When m 2 (p), Exercise
21.3 and Theorem 21.1 can be applied to this result to show that no additional
processors or shared memory cells are needed for simulating PRIORITY by
an EREW PRAM in O(log p) steps.
THEOREM 21.3 Eck79, Vis83, NS81]
One step of PRIORITY with p processors and m shared memory cells
can be simulated by an EREW PRAM in O(log p) steps with p processors
and m + p shared memory cells.
PROOF
As in the proof of Theorem 21.2, each original shared memory cell has
a corresponding shared memory cell in the EREW PRAM and the local
operations of each original processor are performed by a corresponding
processor in the EREW PRAM. An auxiliary array A of size p is also
used.
At a given read or write phase in the PRIORITY algorithm, if processor
Pi wants to access cell Mj , it writes the pair (j i) to memory cell Ai . If
Pi doesn't want to access any shared memory cell, it writes (0 i) to Ai .
Notice that each of the indices 1 : : : p occurs as the second component
of exactly one element of A. For example, if P1 wants to access cell M2 ,
P2 wants to access cell M4, P3 wants to access cell M2 , P4 wants to
access cell M1 , P5 wants to access cell M4, P6 wants to access cell M2 ,
and P7 does not want to access any cell, the array A would have the
following contents.
(2,1) (4,2) (2,3) (1,4) (4,5) (2,6) (0,7)
The array A is then sorted into lexicographic order. This takes O(log p)
steps on an EREW PRAM with p processors. (See Chapter 12.)
(0,7) (1,4) (2,1) (2,3) (2,6) (4,2) (4,5)
Next, each processor Pi appends a bit to the contents of cell Ai . This
bit is 0 when the rst component of Ai is either 0 or the same as the
rst component of Ai 1 otherwise this bit is 1. Observe that processor
;
Pi is the highest priority processor wishing to access cell Mj if the triple
(j i 1) appears in the array A.
(0,7,0) (1,4,1) (2,1,1) (2,3,0) (2,6,0) (4,2,1) (4,5,0)Strana 32
21.2 Restrictions on Access to Shared Memory 11
In the example, P4 is the highest priority processor that wants to access
M1 , P1 is the highest priority processor that wants to access M2 , and
P2 is the highest priority processor that wants to access M4.
From this point, it is easy to nish the simulation of a write phase. First,
each processor Pk reads the triple (j i b) in Ak and writes this triple
into Ai . Then processor Pi reads the triple (j i b) from Ai . If the bit b
has value 1, then Pi is the highest priority processor that wants to write
to Mj , so it can perform its write.
(2,1,1) (4,2,1) (2,3,0) (1,4,1) (4,5,0) (2,6,0) (0,7,0)
In this case, P1 writes to cell M2 , P2 writes to cell M4 , and P4 writes to
cell M1 .
Once the appropriate bit has been appended to the contents of each cell
of A, a read phase can be simulated as follows. Each processor Pk reads
the triple (j i b) in Ak . If the bit b has value 1, then Pk reads shared
memory cell Mj and overwrites the third component of Ak with the value
vj it read. These values are then duplicated as appropriate. Specically,
for l = dlog2 pe : : : 1, processor Pk overwrites the third component of
Ak+2l;1 with the third component of Ak , provided they have the same
rst components. Here are the successive contents of the array in the
example.
(0,7,0) (1,4,v1) (2,1,v2) (2,3,0) (2,6,0) (4,2,v4) (4,5,0)
(0,7,0) (1,4,v1) (2,1,v2) (2,3,v2) (2,6,0) (4,2,v4) (4,5,v4)
(0,7,0) (1,4,v1) (2,1,v2) (2,3,v2) (2,6,v2) (4,2,v4) (4,5,v4)
Finally, each processor Pk reads the triple (j i b) in Ak and writes it
into Ai .
(2,1,v2) (4,2,v4) (2,3,v2) (1,4,v1) (4,5,v4) (2,6,v2) (0,7,0)
Now processor Pi can read the value it needs from Ai.
Concurrent write PRAMs with p processors cannot, in general, be sim-
ulated by exclusive write PRAMs without an (log p) factor increase in time,
even with an innite number of processors and shared memory cells. This is
because the OR of n Boolean variables can be computed in one step on COM-
MON with n processors and one shared memory cell, but requires (log n)Strana 33
12 Chapter 21. The Complexity of Computation on the PRAM
steps on a CREW PRAM with an innite number of processors and shared
memory cells. (See Theorem 21.23.)
EXERCISE 21.6
Prove that the OR of n Boolean variables can be computed in O(logn)
steps on an EROW PRAM with n= log2 n processors and n= log2 n shared
memory cells.
When the amount of shared memory m is small, signicantly more time
is needed to compute the OR of n input bits on a CREW PRAM, even with
an innite number of processors. If the input bits are initially located in
the processors' local memories, then (n=m) steps are required (see Exercise
21.44),
p whereas,
if they are located in a separate read-only shared memory,
n=m steps are required Bea87, VW85].
EXERCISE 21.7
Prove that one step of PRIORITY with p processors and m shared
memory cells can be simulated by an EREW PRAM in O(p=m) steps
with p + m2 processors and m shared memory cells.
EXERCISE 21.8
VW85] Prove that the OR of n Boolean variables
plocated in a separate
read-only shared memory can be computed in O n=m + log m steps
on an EREW PRAM with O(pnm) processors and m shared memory
cells.
EXERCISE 21.9
Suppose the input consists of n bitsp located in a separate read-only
shared memory.p Prove that, in O( n=m + log m) steps, an EREW
PRAM with O( nm) processors and m shared memory cells can deter-
mine whether the input contains two consecutive variables with value 1.
Prove that this problem can be solved in O(1) steps on COMMON with
n processors and 1 shared memory cell.
Both COMMON and ARBITRARY can simulate PRIORITY with only
a constant factor increase in time, by increasing the number of processors and
the amount of shared memory.
THEOREM 21.4 Kuc82]Strana 34
21.2 Restrictions on Access to Shared Memory 13
One step of PRIORITY with p processors and m shared
; memory cells
p
can be simulated by COMMON in O(1) steps with 2 processors and
m + p shared memory cells.
PROOF
Processors P1 : : : Pp are responsible for simulating the original proces-
sors. Simulation of the read and compute phases is trivial, since the two
models do not dier in this respect. To simulate the write phase, each
processor Pi writes the index of the location to which it wishes to write
into the auxiliary shared memory cell Ai . As in the proof of Theorem
21.3, the value 0 denotes the fact that Pi does not wish to write during
this step. For example, the array A might contain the following values.
2 4 2 1 4 2 0
;
Each of the p2 processors then reads a dierent pair of memory cells in
the array. If Ai and Aj contain the same value and i < j, the processor
that read these two cells writes 0 into Aj . This results in the following
array.
2 4 0 1 0 0 0
Finally, for 1 i p, memory cell Ai is read by processor Pi . If
Ai contains a nonzero value j, then processor Pi is the lowest indexed
processor that wants to write to location Mj and, thus, it can perform
the write.
The number of processors used to achieve a constant time simulation
can be reduced if more shared memory is available.
EXERCISE 21.10
FRW88a] Prove that one step of PRIORITY with p processors and m
shared memory cells can be simulated by COMMON in O(1=") steps
with minfp + mp" p1+"g processors and mp"=2 shared memory cells, for
any " 2 O(1).
EXERCISE 21.11
Cha91] A semi-oblivious PRAM algorithm is a PRAM algorithm such
that, for each processor and at each step of the computation, there is
at most one shared memory cell to which it writes. However, whether aStrana 35
14 Chapter 21. The Complexity of Computation on the PRAM
particular processor decides to write at a particular step may depend on
the state it is in (and hence on the values of the input). Prove that any
step of a semi-oblivious PRIORITY algorithm that uses p processors
and m shared memory cells can be simulated in constant time by a
semi-oblivious COMMON algorithm using p processors and m+p shared
memory cells.
THEOREM 21.5 CDHR88]
One step of PRIORITY with p processors and m shared memory cells
can be simulated by COMMON in O(1) steps with p log2 p processors and
mp + p shared memory cells.
PROOF
As in the proof of Theorem 21.2, for each shared memory cell Mj , the
LEFTMOST PRISONER problem is solved using a binary tree with p
leaves. However, instead of having processors proceed up the tree one
level at a time, the bit values at all the internal nodes are determined
simultaneously.
For every original processor Pi, there are log2 p processors in the simu-
lating machine. If processor Pi wants to write to memory cell Mj , these
processors simultaneously write the value 1 to each ancestor of the ith
leaf that has the ith leaf in its left subtree.
Note that, after this step, an internal node has value 1 if and only if
there is a leaf in its left subtree corresponding to an original processor
that wants to write to Mj . Thus, processor Pi is the leftmost processor
that wants to write to memory cell Mj if and only if every ancestor of
the ith leaf that has the ith leaf in its right subtree contains the value 0.
This can be determined in constant time by computing the OR of these
values using one additional memory cell associated with processor Pi.
Consider the example in Figure 21.2. Since P3 is in its parent's and
great grandparent's left subtree and in its grandparent's right subtree,
the OR that is computed has value 0. However, P4 is in its parent's and
grandparent's right subtrees and P5 is in its great grandparent's right
subtree, so the ORs that are computed for both these processors have
value 1.Strana 36
21.2 Restrictions on Access to Shared Memory 15
1P
PP
PP
0 1
0
Q
Q
1
1
Q
Q
0k
0k A0k 1kP A1kP 1kP AA0k
3 4 5
FIGURE 21.2
The result of the rst step of the LEFTMOST PRISONER algorithm in the proof
of Theorem 21.5. In this example, p = 7 and processors P3 , P4, and P5 wish to
write to the shared memory cell.
Even without increasing the number of processors, it is still possible to
simulate PRIORITY faster on COMMON than on the EREW PRAM.
THEOREM 21.6 FRW88b]
One step of PRIORITY with p processors and m shared memory cells
can be simulated by COMMON in O(log p=loglogp) steps with p proces-
sors and O(mp) shared memory cells.
PROOF
Here the LEFTMOST PRISONER problem is solved using a completely
dierent approach. For p = 2, the problem can be solved in one write
phase and one read phase using one shared memory cell that is initialized
to 0. Specically, if P1 has value 1, it writes 1 into the cell. In this case,
it is the leftmost processor with value 1. If P2 has value 1, it reads the
cell. It is the leftmost processor with value 1 if and only if it reads the
value 0.
Given an algorithm that solves the problem for p = t! in t ; 1 write
and read phases using mt 1 shared memory cells, we can construct an
;
algorithm that solves the problem for p = (t + 1)! in t write and read
phases using mt = (t + 1)mt 1 + 1 shared memory cells, as follows. The
;
problem is divided into t + 1 subproblems of size t!. Group 1 consists
of the rst t! processors, group 2 consists of the next t! processors, etc.
Each subproblem is solved independently, using mt 1 shared memory
;Strana 37
16 Chapter 21. The Complexity of Computation on the PRAM
cells and t ; 1 of the write and read phases. One shared memory cell
is used for the groups to interact with one another, so that the group
containing the leftmost processor with value 1 can be determined. This
cell initially contains the value 0. In the ith write phase, all processors
(with value 1) in group i write 1 to the cell and in the next read phase
all processors (with value 1) in group i + 1 read the cell. Notice that
after the ith write phase, the cell contains the value 0 if and only if
no processors in the rst i groups have the value 1. Thus if there are
any processors with value 1 in group i + 1, they will learn whether their
group contains the leftmost processor with value 1.
The number of shared memory cells used to solve each leftmost prisoner
problem satises the recurrence
m1 = 1
mt = (t + 1)mt 1 + 1 for t > 1:
;
This implies mt 2 O((t + 1)!) = O(p). Therefore the total number of
shared memory cells used in the simulation is O(mp).
The ideas in the proofs of the two preceding theorems can be combined
to get increasingly faster simulations of PRIORITY by COMMON as the
number of processors in the simulating machine increases.
EXERCISE 21.12
Ragb, Bop89] Prove that one step of PRIORITY with p processors
and m shared memory cells can be simulated by COMMON in t steps
with kp processors, for 1 k log2 p, provided t log2 t 2 O( logk p ) i.e.
t 2 O( k(log loglogpp log k) ). How many shared memory cells do you use?
;
A matching ( k(log loglogpp log k) ) lower bound for solving the LEFTMOST
;
PRISONER problem on COMMON has been achieved Ragb]. This does not
imply that the simulation of PRIORITY by COMMON in Exercise 21.12 is
optimal it only implies that a better simulation cannot be achieved by this
approach. However, the algorithm is known to be optimal for simulating
either PRIORITY or ARBITRARY with suciently large amounts of shared
memory. Consider the following problem.
ELEMENT DISTINCTNESS
Given n input values x1 : : : xn in the range f1 : : : rg, determine
if they are pairwise distinct (i.e. xi 6= xj for i 6= j).Strana 38
21.2 Restrictions on Access to Shared Memory 17
ELEMENT DISTINCTNESS is easy to solve on ARBITRARY (and, hence,
PRIORITY), given sucient shared memory.
EXERCISE 21.13
Prove that ELEMENT DISTINCTNESS can be solved in O(1) steps on
ARBITRARY using n processors and r shared memory cells.
An algorithm for solving ELEMENT DISTINCTNESS on COMMON can be
obtained by simulating the algorithm in Exercise 21.13. using the simulation
in Exercise 21.12. If the size, r, of the input domain is suciently large, then
the resulting algorithm is optimal.
THEOREM 21.7 Bop89, Edm91]
COMMON with kn processors and an in nite amount of shared memory
log n
requires ( k(log log;n log k) ) steps to solve ELEMENT DISTINCTNESS,
provided the size of the input domain is suciently large.
This result implies that, for r suciently large, COMMON with kn pro-
log n
cessors and innite shared memory requires ( k(log log n log k) ) times as many
;
steps to solve ELEMENT DISTINCTNESS as PRIORITY or ARBITRARY
with n processors and r shared memory cells. Thus, the simulation men-
tioned in Exercise 21.12 is optimal whenever the machine being simulated
has a suciently large shared memory. However, when the amount of shared
memory m is smaller (for example, only exponential in the number of proces-
sors), the optimality of this simulation is unknown. The diculty with using
ELEMENT DISTINCTNESS to separate COMMON from ARBITRARY and
PRIORITY in this case is that, if the domain size r is small, there are not
good lower bounds known for ELEMENT DISTINCTNESS on COMMON
and, if r is large, there are not good upper bounds known for ELEMENT
DISTINCTNESS on ARBITRARY or PRIORITY.
ARBITRARY can simulate PRIORITY substantially faster than COM-
MON can.
THEOREM 21.8 CDHR88]
One step of PRIORITY with p processors and m shared memory cells
can be simulated by ARBITRARY in O(loglog p) steps using p proces-
sors and m(p ; 1) shared memory cells.
PROOFStrana 39
18 Chapter 21. The Complexity of Computation on the PRAM
This algorithm also works by solving thep LEFTMOST PRISONER prob-
lem. The processors are divided into p contiguous groups of size pp.
Each group chooses a representative processor with value 1, if there is
one. This is done as follows, using one cell for each group. The cell is
initialized to 0. Then all processors that are in the group and have value
1 attempt to write their indices into the cell. The processor whose index
appears in the cell is chosen to be the representative for the group. All
processors (with value 1) in the group can then read the cell to nd out
which processor is the representative.
For example, if there are nine processors, of which P5, P6, P7, and P9
have value 1, then the memory cells associated with the three groups
f1 2 3g, f4 5 6g, and f7 8 9g could contain the values 0, 6, and 7,
respectively.
Determining the leftmost p group that contains a processor with value
1 is a subproblem of size p that is solved recursively by the chosen
representatives. Those processors (with value 1) that were not chosen as
representatives are used to recursively determine the leftmost processors
with p value 1 in their groups. Each of these subproblems has size less
than p.
The total time to solve a LEFTMOST PRISONER problem of size p sat-
ises the recurrence t(p) = t(pp)+O(1). This implies t(p) 2 O(loglog p).
It can also be shown inductively that at most p ; 1 memory cells are
used to solve a LEFTMOST PRISONER problem of size p.
EXERCISE 21.14
Ragb] Prove that one step of PRIORITY with p processors and m
log log p
shared memory cells can be simulated by ARBITRARY in O( log( k+1) )
steps with kp processors, for 1 k log2 p. How many shared memory
cells do you use?
In particular, the result in Exercise 21.14 implies that one step of PRI-
ORITY with p processors can be simulated by ARBITRARY in O(1=") steps
using O(p(log p)") processors, for 0 < " 1.
log log p
There is a matching lower bound of ( log( k+1) ) steps on ARBITRARY
for solving a generalized version of the LEFTMOST PRISONER problem in
which each input bit is known to a group of k processors Ragb]. It is open
whether a better simulation of PRIORITY by ARBITRARY can be achieved
in some other way.Strana 40
21.3 Relationships Between PRAMs and Other Models 19
When the amount of shared memory is small and the input is initially
distributed among the processors' local memories, o(log p) time simulations of
CRCW PRAMs with stronger write conict resolution mechanisms by CRCW
PRAMs with weaker ones are not possible without increasing the number of
processors or shared memory cells. Specically, (log(p=m)) steps are re-
quired to simulate either one step of ARBITRARY by COMMON or one step
of PRIORITY by ARBITRARY, using the same number of processors and
the same number of shared memory cells. (See Exercise 21.42 and Theorem
21.41.) If m 2 O(p1 ") for some constant " > 0, this implies that (log p)
;
steps are required.
It is more dicult to obtain lower bounds when the input is located
in a separate read-only shared memory. In this case, (log log(p=m)) steps
are known to be required to simulate ARBITRARY with p processors and m
shared memory cells by COMMON with p processors and m shared memory
cells and (log(p=m)= loglog(p=m)) steps are known to be required to simulate
PRIORITY with p processors and m shared memory cells by ARBITRARY
with m shared memory cells and any number of processors FLRY89]. In
contrast, PRIORITY and ARBITRARY require the same amount of time
to compute any symmetric Boolean function using one shared memory cell
LY86, LY87].
21.3
Relationships Between PRAMs and Other Models
The PRAM is closely related to other models of computation. This
section considers relationships between PRAMs and four theoretical models:
Boolean circuits, unbounded fan-in Boolean circuits, Turing machines, and
decision trees. These relationships are important for obtaining a better un-
derstanding of parallel computation and because good upper and lower bounds
on PRAMs can often be obtained from corresponding bounds on these models.
A Boolean circuit is a directed acyclic graph whose nodes are either
inputs, which have fan-in 0, or gates, which have fan-in at most 2. The inputs
are labelled by distinct variables. Each gate is labelled by a Boolean function
whose arity is equal to the fan-in of the gate. Some nodes are also designated
as outputs. The depth of a Boolean circuit is the length of the longest directed
path from an input to an output and its size is the number of gates it contains.
A CROW PRAM can easily simulate a Boolean circuit using one pro-
cessor for each gate. The processor reads the values at the nodes with edgesStrana 41
20 Chapter 21. The Complexity of Computation on the PRAM
directed into the gate as they become available, performs the gate computa-
tion, and writes the result in its shared memory cell. If the fan-out of every
node in the circuit is bounded by a constant, then the simulation can be done
using only exclusive reads. However, any Boolean circuit can be converted
to a Boolean circuit with nodes of fan-out at most 2 and only constant fac-
tor increases in size and depth HKP]. Therefore concurrent reads are not
necessary.
THEOREM 21.9
Any Boolean circuit of depth d and size s can be can be simulated by an
EROW PRAM in O(d) steps using s processors and s shared memory
cells.
An unbounded fan-in Boolean circuit is a directed acyclic graph com-
posed of input nodes with fan-in 0, NOT gates with fan-in 1, and unbounded
fan-in AND and OR gates. Its depth is the length of the longest directed path
from an input to an output and its size is the number of edges in the graph.
Since COMMON can compute the AND or OR of n bits in one step using n
processors and one shared memory cell, it can easily simulate any unbounded
fan-in Boolean circuit.
THEOREM 21.10 SV84]
Any unbounded fan-in Boolean circuit of depth d and size s can be sim-
ulated by COMMON in d steps using s processors and s shared memory
cells.
Any Boolean function f : f0 1gn ! f0 1g can be expressed by a formula
of size O(n2n) in disjunctive or conjunctive normal form and thus can be
computed by COMMON in two steps using exponentially many processors
and cells. More generally, COMMON can use table look-up to compute any
function f : f0 1gn ! R with Boolean inputs in two steps with n2n processors
and 2n shared memory cells. The idea is to allocate n processors to each
of the 2n possible inputs. Each group checks if it corresponds to the true
input by reading the input and performing an AND. One processor from the
unique group whose AND is 1 writes the answer. Thus it is impossible to get
nontrivial lower bounds on time for CRCW PRAMs without restricting the
number of processors or shared memory cells.
EXERCISE 21.15Strana 42
21.3 Relationships Between PRAMs and Other Models 21
CSV84] Prove that any Boolean function computed by a Boolean circuit
of depth O(log log n) can be computed by an unbounded fan-in circuit
of depth two and nO(1) size.
The complexity classes NCk and ACk consist of those Boolean functions
that can be computed in O((log n)k ) depth and nO(1) size by Boolean circuits
and unbounded fan-in Boolean circuits, respectively. Theorem 21.9 implies
that any problem in NCk can be solved by an EROW PRAM in O((logn)k )
steps using nO(1) processors. Similarly, by Theorem 21.10, any problem in
ACk can be solved by COMMON in O((log n)k ) steps using nO(1) processors.
It follows from Exercise 21.15 that any problem in NCk can be computed
by an unbounded fan-in circuit of depth O((log n)k = log log n) and size nO(1).
Combined with Theorem 21.10, this result shows that any problem in NCk
can be solved by COMMON in O((log n)k = loglog n) steps using nO(1) proces-
sors and nO(1) memory cells. Thus the upper bound on the time for solving
problems in NCk can be improved by allowing concurrent writes.
The word size of a PRAM is the maximum number of bits that can be
contained in each cell of shared memory. Note that Theorems 21.9 and 21.10
and the corollaries mentioned above remain valid even with the restriction
that the word size is 1.
Next, we consider some sequential complexity classes. Specically, let
DLOG denote the class of problems that can be solved by deterministic Tur-
ing machines using logarithmic space. Let NLOG be the analogous class for
nondeterministic Turing machines. Theorems 21.11 and 21.12 will show that
all problems in the classes DLOG and NLOG can be solved in logarithmic
time by the EROW PRAM and COMMON, respectively, using a polynomial
number of processors.
The con guration graph of a Turing machine on a given input has one
node for each possible combination of state, head positions, work tape con-
tents, and time. Directed edges correspond to transitions. Note that the
conguration graph is acyclic. Furthermore, since a Turing machine has a
constant size alphabet, every node has constant fan-in and fan-out.
If a Turing machine uses O(log n) space, then it uses nO(1) time and
its conguration graph has nO(1) nodes. Moreover, the conguration graph
can easily be constructed by an EROW PRAM in O(log n) steps using nO(1)
processors. Each node in the conguration graph of a deterministic Turing
machine has fan-out at most one. In this case, parallel tree contraction (see
Chapter 2) can be used to determine whether there is a path from the initialStrana 43
22 Chapter 21. The Complexity of Computation on the PRAM
conguration to an accepting conguration.
THEOREM 21.11 FW78, KR90]
Any problem in DLOG can be solved by an EROW PRAM in O(logn)
steps using nO(1) processors.
For nondeterministic Turing machines, it suces to compute the tran-
sitive closure of the conguration graph. This can be done by an unbounded
fan-in Boolean circuit with O(logn) depth and nO(1) size Bor77] and, hence,
from Theorem 21.10, by COMMON in O(log n) steps using nO(1) processors.
THEOREM 21.12
Any problem in NLOG can be solved by COMMON in O(log n) steps
using nO(1) processors.
For any problem in NP, a superlogarithmic lower bound on the EREW
PRAM with a polynomial number of processors would imply that DLOG is
strictly contained in NP. Similarly, !(log n=log log n) lower bounds on COM-
MON with a polynomial number of processors for problems in NP would imply
that NC1 is strictly contained in NP. Since it is currently unknown whether
the inclusions NC1 DLOG NP are strict, such lower bounds could be
dicult to obtain.
Unbounded fan-in Boolean circuits can also simulate PRAMs. A re-
stricted instruction set PRAM is a PRAM in which only addition, subtraction,
comparison, bitwise Boolean operations, read, write, and indirect addressing
are allowed. Multiplication and division of small numbers can also be permit-
ted without changing the results.
THEOREM 21.13 SV84]
A restricted instruction set PRAM with inputs of word size using t
steps and p processors can be simulated by an unbounded fan-in circuit
of depth O(t) and size (pt)O(1).
In particular, any Boolean function that can be computed by a restricted in-
struction set PRAM in O((log n)k ) steps using nO(1) processors is in ACk .
If COMMON(t) denotes those Boolean functions that can be computed by aStrana 44
21.3 Relationships Between PRAMs and Other Models 23
restricted instruction set COMMON PRAM in O(t) steps using nO(1) proces-
sors, then
AC0 NC1 DLOG NLOG
k \
COMMON(1) COMMON logloglogn n
AC1 NCk ACk
k \ k
(log n)k
COMMON(log n) COMMON log log n COMMON((log n)k ):
Note that by Theorem 21.4, these relationships are also valid when COMMON
is replaced by ARBITRARY or PRIORITY.
PRAMs that allow a limited number of other instructions such as ar-
bitrary multiplication, division, and shifts have also been studied Tra88,
TLR88].
An easy counting argument Ruz] shows that almost all Boolean func-
tions require exponential size unbounded fan-in circuits. Combining this re-
sult with Theorem 21.13 gives an analogous result for restricted instruction
set PRAMs.
COROLLARY 21.14
Almost all Boolean functions require an exponential number of processors
to be computed by a restricted instruction set PRAM in polynomial time.
Lower bounds for specic functions on unbounded fan-in Boolean circuits
can be translated into lower bounds on restricted instruction set PRAMs. For
example, consider the following lower bound for PARITY.
THEOREM 21.15
Has87] Any unbounded fan-in circuit of depth k that computes PARITY
has size 2 (n ).
1=(k;1)
In particular, any unbounded fan-in circuit with nO(1) size computing the
PARITY of n input bits (i.e. x1 xn) has depth (log n= loglog n).
Together with Theorem 21.13, this implies that any restricted instruction
set PRAM with nO(1) processors requires (log n= log log n) time to computeStrana 45
24 Chapter 21. The Complexity of Computation on the PRAM
PARITY. Since PARITY 2 NC1, it follows that this lower bound is tight to
within a constant factor.
Lower bounds for computing specic problems, such as integer addition,
have also been directly obtained on restricted instruction set CRCW PRAMs
MR84]. These proofs and the lower bounds obtained using Theorem 21.13
depend in an essential way on the restricted instruction set.
In contrast, an abstract or ideal PRAM, places no restriction on the
instruction set. At each step, a processor can compute any function of its
local information. Furthermore, shared memory cells are allowed to contain
arbitrarily large values. Although this model is unrealistic, lower bounds in
this model have great generality since they do not depend on any assumptions
about the instruction set. These lower bounds are actually lower bounds on
the amount of communication, as opposed to the amount of computation, nec-
essary to solve a problem. They give insight into the nature of communication
between processors and may point out where bottlenecks in communication
arise. Abstract PRAMs also approximate the situation where communication
to and from shared memory is much more expensive than local operations,
for example, where each processor is located on a separate chip and access to
shared memory is through a combining network.
Not surprisingly, abstract PRAMs can be much more powerful than
restricted instruction set PRAMs.
THEOREM 21.16
Any function of n variables can be computed by an abstract EROW
PRAM in O(log n) steps using n= log2 n processors and n=2 log2 n shared
memory cells.
PROOF
Each processor begins by reading log2 n input values and combining
them into one large value. The information known by processors are
combined in a binary-tree-like fashion. In each round, the remaining
processors are grouped into pairs. In each pair, one processor commu-
nicates the information it knows about the input to the other processor
and then leaves the computation. After dlog2 ne rounds, one processor
knows all n input values. Then this processor computes the answer in a
single additional step.Strana 46
21.3 Relationships Between PRAMs and Other Models 25
In particular, superlogarithmic lower bounds for time cannot be obtained on
an abstract PRAM without severely restricting the numbers of processors or
shared memory cells.
EXERCISE 21.16
Bea87, Bea88] Prove that any function f : f0 1gn ! R can be computed
by abstract COMMON in log2 n ; log2 log2 (p=n) + O(1) steps using
p 2n processors and p= log2(p=n) shared memory cells.
Processors in an abstract PRAM can read and write arbitrarily large
values. However, to handle large input and output values, unbounded fan-in
Boolean circuits need large numbers of gates. Except for this technicality,
unbounded fan-in Boolean circuits can simulate abstract PRAMs.
THEOREM 21.17 LY89]
If an (abstract CRCW) PRAM computes a Boolean function in t steps
using p processors, then the function can be computed
t+O(1)
by an unbounded
fan-in Boolean circuit of depth O(t) and size p 2 .
This result was proved using Kolmogorov complexity. Specically, given
a PRAM, the number of bits needed to describe the states of processors and
the addresses and contents of accessed shared memory cells does not grow too
quickly with time. In constant depth, an unbounded fan-in Boolean circuit
can compute these descriptions (rather than the actual states, addresses, and
contents) at any given step from the descriptions at the previous step.
COROLLARY 21.18 BH89, Bea87]
Almost all Boolean functions of n variables require log2 n ; log2 log2 p +
(1) steps to be computed by an (abstract CRCW) PRAM with p pro-
cessors.
Together, theorems
p 21.15 and 21.17 imply that a PRAM with nO(1) pro-
cessors requires ( log n) steps to compute the PARITY of n bits. Although
this lower bound is not tight, the proof is much easier than the direct proof
of Theorem 21.33.
Limiting the word size can signicantly aect the amount of time re-
quired to solve certain problems. Dierent organizations of memory, such as
a small number of shared memory cells with large word size or a large numberStrana 47
26 Chapter 21. The Complexity of Computation on the PRAM
with small word size, can be better for dierent problems Bel88, Bel91]. For
reasonable bounds on the word size, improved versions of Theorem 21.17 and
Corollary 21.18 can be obtained.
THEOREM 21.19 Bel88, Bel91]
If an (abstract CRCW) PRAM with word size computes a Boolean
function using t steps and p processors, then the function can be com-
puted by an unbounded fan-in circuit of depth O(t) and size pO(1) 2O(t) .
COROLLARY 21.20
Almost all Boolean functions of n variables require n log p steps to be
;
computed by an (abstract CRCW) PRAM with word size using p pro-
cessors.
From Theorems 21.15 and 21.19, it alsoOfollows that an (abstract CRCW)
O
PRAM with (log n) word size and n
(1) (log n) (1)
processors requires (log n= log log n)
steps to compute the PARITY of n input bits. Since PARITY is in NC1 , this
lower bound on time is tight to within a constant factor.
The decision tree is a widely used model for obtaining sequential lower
bounds. An algorithm consists of a tree in which every internal node is labelled
by an input variable and every leaf is labelled by a possible answer. An internal
node has one child corresponding to each possible value of the input variable
that labels it. The execution of a decision tree algorithm begins at its root.
When an internal node is visited, the value of the input variable labelling the
node determines the appropriate child to visit next. The output is the label
of the leaf that is reached. The decision tree complexity of a problem is the
minimum depth of any decision tree that solves the problem (expressed as
a function of the number of input variables). Like the abstract PRAM, the
decision tree ignores the computation used to solve a problem. Rather, it
focuses attention on how much of the input must be examined to determine
the answer. There is a very close correspondence between the decision tree
model and the CROW PRAM.
THEOREM 21.21 Raga]Strana 48
21.3 Relationships Between PRAMs and Other Models 27
Any function that can be computed by a CROW PRAM in t steps can
be computed by a decision tree of height 2t .
PROOF
Consider any CROW PRAM that computes a function in t steps using
p processors. For 1 i p and 1 t t, let S(i t ) denote an ordered
0 0
pair consisting of the state of processor Pi and the contents of its corre-
sponding memory cell Mi immediately after step t . We will inductively
0
dene decision trees T(i t ) that compute S(i t ). The desired decision
0 0
tree can then be obtained from T (1 t) if the ordered pair labelling each
leaf is replaced by its second component.
Let qi be the initial state of processor Pi. If Mi initially contains the ith
input value, then T (i 0) is a tree of height 1. Its root is labelled xi and
there is a leaf labelled (qi v) for each possible value v of xi . Otherwise,
Mi initially contains the value 0 and T (i 0) consists of a single node
labelled (qi 0).
The decision tree T (i t + 1) is created by modifying the decision tree
0
T(i t ). Each leaf labelled (q v) in T(i t ) indicates that processor Pi
0 0
is in state q and memory cell Mi contains v at the end of step t on 0
those inputs that lead to the leaf. Suppose that, in this state, Pi reads
memory cell Mi0 during step t + 1. Then each leaf of T(i t ) labelled
0 0
(q v) is replaced by the decision tree T (i t ) and the labels of the leaves
0 0
are changed appropriately. Specically, the label (q v ) of a leaf in this
0 0
subtree is changed to (q v ) if Pi goes into state q and writes the value
00 00 00
v as a result of reading the value v from Mi0 .
00 0
It is easy to verify that the height of T(i t ) is at most 2t0 .
0
THEOREM 21.22
Raga] If a function can be computed by a decision tree of height h, then
it can be computed by a CROW PRAM in 1 + dlog2 he steps.
PROOF
Consider any decision tree of height h. Associate one CROW PRAM
processor with each node in the decision tree. Let processor P1 be associ-
ated with the root. To begin the computation, each processor associated
with an internal node reads the value of the input variable labelling itsStrana 49
28 Chapter 21. The Complexity of Computation on the PRAM
node and writes a pointer to the processor associated with the child cor-
responding to that value. Then, using pointer jumping, the path in the
decision tree from the root to a leaf can be determined in dlog2 he more
steps.
From Theorem 21.21, a lower bound of h on a decision tree implies
a lower bound of dlog2 he on a CROW PRAM. Thus the logarithm of the
decision tree complexity of a problem almost exactly characterizes the time it
takes to solve the problem on a CROW PRAM.
EXERCISE 21.17
Sni] Prove that a CROW PRAM requires dlog2 ne steps to compute the
OR of n input bits, even if it is known that all possible inputs contain
at most one bit with value 1.
EXERCISE 21.18
FR90] Prove that a CROW PRAM requires dlog2 (n;1)e steps to reverse
the direction of the links in a singly linked list of length n.
21.4
Lower Bound Techniques
Many of the lower bound proofs created specically for (abstract) PRAMs
look at how processors and memory cells accumulate knowledge about the
input as computation proceeds and what kinds of knowledge can be accumu-
lated. They also nd ways of measuring that knowledge and then showing
that it cannot increase too quickly.
One of the facts that makes proving PRAM lower bounds dicult is
that processors can communicate information by deciding not to write to a
particular shared memory cell. This is nicely illustrated by the following
example in which x1 _ x2 _ x3 _ x4 _ x5 is computed in 2 steps on an EREW
PRAM. Note that, by Exercise 21.17, a CROW PRAM requires 3 steps for
this computation.
Initially, memory cells M1 M2 M3 M4, and M5 contain the input bits
x1 x2 x3 x4, and x5 , respectively. During step 1,
processor P1 reads M2 and, if x2 = 1, writes 1 into M1 ,
processor P3 reads M4 and, if x4 = 1, writes 1 into M3 , and
processor P5 reads M5 ,Strana 50
21.4 Lower Bound Techniques 29
resulting in the following memory contents.
x1 _ x2 x2 x3 _ x4 x4 x5
At step 2, processor P5 reads M3. If x5 _ (x3 _ x4) = 1, then P5 writes 1
into M1 . At this point, memory cell M1 contains the value x1 _ x2 _ x3 _ x4 _ x5,
as desired. Note that processor P5 communicates the fact that x3 _ x4 _ x5 = 0
by not writing into memory cell M1 .
This idea can be generalized to show that the OR of n bits can be
computed almost 1.4 times faster on an EREW PRAM than on a CROW
PRAM. The tth Fibonacci number, Ft , is dened by the recurrence
F0 = 0
F1 = 1 and
Ft = Ft 1 + Ft 2 for t 2
p p
; ;
and satises the inequalities ((1+ 5)=2)t 2 < Ft ((1+ 5)=2)t 1 for t > 0.
; ;
EXERCISE 21.19
CDR86] Prove that an EREW PRAM can compute the OR of n input
bits using n processors and shared memory cells in t steps, provided
F2t+1 n.
In the example above, there is at most one processor that writes into a
particular memory cell at a particular time step on any input. This is not the
case for all EREW and CREW PRAM algorithms. Depending on the input,
dierent processors may write into a particular cell at a given step. However,
the conditions under which the dierent processors perform these writes are
mutually exclusive. The fact that none of these conditions are satised by
the input is conveyed when no write occurs. Under some circumstances, this
could be a lot of information. For example, if the input is known to contain at
most one bit with value 1, then the OR of the input bits can be computed in
one step on an EREW PRAM. This is accomplished by having each processor
read an input bit and, if it reads the value 1, write 1 to memory cell M1.
In contrast, from Exercise 21.17, a CROW PRAM requires dlog2 ne steps to
perform this computation.
21.4.1 Sensitivity, Block Sensitivity, and Degree
The algorithm in Exercise 21.19 is optimal. More generally, it is possible
to characterize, to within a small constant factor, the time needed to compute
total n-ary functions on a CREW PRAM in terms of certain simple properties.Strana 51
30 Chapter 21. The Complexity of Computation on the PRAM
A function f with domain D D1 Dn and range R is sensitive
to the set of coordinates S f1 : : : ng on input x 2 D if there exists an input
y 2 D such that f(x) 6= f(y) and xj = yj for all j 62 S. The sensitivity or
critical complexity of f is
max
x D
2
#fijf is sensitive to fig on input xg.
For example, the OR of n Boolean variables has sensitivity n (consider the
all 0 input) as does their PARITY (consider any input). The function that
computes the MAXIMUM of any n input values also has sensitivity n.
The sensitivity of a function can be used to obtain a lower bound on the
amount of time necessary to compute it on a CREW PRAM. This is done by
showing that the number of coordinates aecting the state of a processor or
the contents of a memory cell cannot grow too quickly as the computation
proceeds. Formally, coordinate i aects processor P (or memory cell M)
on input x 2 D1 Dn at time t if the state of P (or contents of M)
immediately after step t is dierent on the inputs x and y, for some input
y 2 D1 Dn that is the same as x except in coordinate i. Since the
maximum number of coordinates that aect memory cell M1 at the end of
the computation must be at least the sensitivity of the function f, this leads
to a lower bound on time.
THEOREM 21.23 CDR86]
A CREW PRAM requires at least logb(sensitivity(f))
p steps to compute
a function f : D1 Dn ! R, where b = (5 + 21)=2.
PROOF
Let s(t) and c(t) denote the maximum number of coordinates aecting
any processor or shared memory cell, respectively, on any input at time
t. Then
s(0) = 0 and c(0) = 1:
Suppose processor P reads from memory cell M on input x during step
t. Then any coordinate i aecting P on input x at time t either aects
P on input x at time t ; 1 or aects M on input x at time t ; 1. To see
this, consider any input y that diers from x only in coordinate i and
causes P to have a dierent state immediately after step t. If, on input
y, P is in the same state at time t ; 1 as it is on x, then it reads from
cell M during step t. Furthermore, if it nds the same value there, thenStrana 52
21.4 Lower Bound Techniques 31
processor P is in the same state immediately after step t on both these
inputs, a contradiction. Thus
s(t) s(t ; 1) + c(t ; 1):
Now consider any shared memory cell M on any input x. There are two
cases to consider. If some processor P writes to M on input x during
step t, then any coordinate aecting M on input x at time t also aects
P on input x at time t. There are at most s(t) such coordinates.
Otherwise no processor writes to M on input x during step t. Let i be
any coordinate that aects M on input x at time t. If i does not aect
M on input x at time t ; 1, there must be an input y(i) that diers from
x only in coordinate i and a processor P (i) that writes to M on input
y(i) during step t.
Let j be any other coordinate that aects M on input x at time t.
If P (i) 6= P (j ), then either coordinate j aects P (i) on input y(i) or
coordinate i aects P (j ) on input y(j ) at time t. Otherwise, during step
t, processors P (i) and P (j ) would both write to M on input z, where
8 (i)
>
< yi if k = i
zk = > yj(j ) if k = j
: xk otherwise.
Suppose there are exactly v coordinates that aect M on input x at
time t, but not at time t ; 1. Consider a graph whose vertices are these
v coordinates. In this graph there is an edge from i to j if and only if
P (i) 6= P (j ) and coordinate j aects P (i) on input y(i) at time t. Since
no processor is aected by more than s(t) coordinates at time t on a
given input, this graph contains at most v s(t) edges.
For any processor P, at most s(t) of the v coordinates i are such that
P = P (i) . This is because these coordinates all aect P on input x at
time t. Thus there are at least v (v ; s(t)) ordered pairs (i j) such
that P (i) 6= P (j ). At least half of these must be edges in the graph.
Hence v s(t) v (v ; s(t))=2, which implies that v 3s(t). Since M
is aected by at most c(t ; 1) + v coordinates at time t,
c(t) c(t ; 1) + 3s(t):
From the solution of the resulting recurrence, it follows that s(t) c(t) <
bt so at least logb(sensitivity(f)) steps are required to compute f.Strana 53
32 Chapter 21. The Complexity of Computation on the PRAM
EXERCISE 21.20
CDR86] Show that the OR of n input bits can be computed in k +
O(log k) steps on a CREW PRAM if the input is known to contain at
most k bits with value 1.
EXERCISE 21.21
Where does the proof of Theorem 21.23 break down for computing the
OR of n input bits, when the input is known to contain at most one bit
with value 1?
EXERCISE 21.22
Prove that a CREW PRAM requires (minfk logng) steps to compute
the OR of n input bits, when the input is known to contain at most k
bits with value 1.
EXERCISE 21.23 p
Bea87] Prove that no coordinate aects more than (2+ 3)t processors
or memory cells after t steps of an EREW PRAM computing a Boolean
function.
The function f depends on coordinate i if f is sensitive to fig on some
input x 2 D1 Dn . In other words, there are two inputs x and y,
diering only on coordinate i, such that f(x) 6= f(y).
THEOREM 21.24 Sim82]
Every function f : f0 1gn ! f0 1g that depends on k of its coordinates
has sensitivity (logk).
COROLLARY 21.25 Sim82]
A CREW PRAM requires (log log k) steps to compute any function
f : f0 1gn ! f0 1g that depends on k of its coordinates.
EXERCISE 21.24
Consider the Boolean addressing function f : f0 1gn ! f0 1g dened
by
f(x1 : : : xr y0 : : : y2r 1 ) = yj
P r
;
where j = i=1 xi2r i and n = r + 2r . It uses the binary number
;
formed by concatenating the rst r bits as an index to select one of theStrana 54
21.4 Lower Bound Techniques 33
remaining bits. Prove that this Boolean function has sensitivity r + 1
and depends on all n of its coordinates. How quickly can you compute
this function on a CREW PRAM?
EXERCISE 21.25
Show that the addressing function f : f0 : : : n ; 1g f0 1gn ! f0 1g
dened by
f(i y0 : : : yn 1) = yi
;
can be computed by a PRAM in constant time using only one processor.
Explain the dierence between this result and Exercise 21.24.
A very useful generalization of sensitivity is block sensitivity Nis89]. For
any function f with domain D D1 Dn , it is dened to be
max
x D
2
maxfkjf is sensitive to k disjoint subsets of coordinates on input xg.
Clearly, the sensitivity of a function f is bounded above by its block sensi-
tivity. However, there are functions whose sensitivity is less than their block
sensitivity. (See Exercises 21.26 and 21.27.) It is an open question whether
the sensitivity of a function is always at least some polynomial of its block
sensitivity or whether it can be exponentially smaller.
EXERCISE 21.26
Nis89, WZ88] Consider the Boolean function of n variables that has
value 1 when exactly bn=2c or bn=2c + 1 of the input bits have value
1. What is the sensitivity, the block sensitivity, and the decision tree
complexity of this function?
EXERCISE 21.27
Rub] Exhibit a Boolean function of n variables with sensitivity O(pn)
and block sensitivity (n).
The lower bound in Theorem 21.23 can be extended to block sensitivity.
THEOREM 21.26 Nis89]
A CREW PRAM requires at least logb (block sensitivity(f))
p steps to
compute a function f : D1 Dn ! R, where b = (5 + 21)=2.
PROOFStrana 55
34 Chapter 21. The Complexity of Computation on the PRAM
Suppose the block sensitivity of f is k and, on input x, f is sensitive to
the disjoint sets of coordinates S1 : : : Sk . For j = 1 : : : k, let y(j ) 2
D1 Dn be an input such that f(y(j ) ) 6= f(x) and yi(j ) = xi for all
i 62 Sj . Given (z1 : : : zk ) 2 f0 1gk, dene g(z1 : : : zk ) to be the value
of the function f on the input w = (w1 : : : wn), constructed as follows.
If zj = 1, then make w agree with y(j ) on all coordinates in the set Sj .
If zj = 0, then make w agree with x on all coordinates in the set Sj . On
those coordinates not in any of the sets S1 : : : Sk , also make w agree
with x. In other words,
( (j)
wi = yi if i 2 Sj and zj = 1
xi otherwise.
The function g : f0 1gk ! R has sensitivity k and can be computed by
a CREW PRAM at least as quickly as the function f. It follows that
any CREW PRAM requires at least logb (k) steps to compute f.
A set of coordinates C is a certi cate for a function f on an input x
if f(x) = f(y) for every input y that agrees with x on all coordinates in C.
In other words, knowing the values xi for all i 2 C determines the value of
f(x). For example, a set of coordinates corresponding to the variables in a
minterm or maxterm of a Boolean function is a certicate. The certi cate
complexity or nondeterministic decision tree complexity of a function f with
domain D D1 Dn is
max
x D
2
minf#C jC is a certicate for f on input xg.
The block sensitivity, certicate complexity, and decision tree complexity of
a function f : D1 Dn ! R are always closely related.
EXERCISE 21.28
Prove that, for any function f : D1 Dn ! R, block sensitivity(f)
certi cate complexity(f) decision tree complexity(f).
THEOREM 21.27 Nis89]
For any function f : D1 Dn ! R, certi cate complexity(f )
(block sensitivity(f))2 .
PROOFStrana 56
21.4 Lower Bound Techniques 35
Suppose the block sensitivity of f is k. For any input x, let B be a
maximal collection of disjoint sets of coordinates such that each S 2 B
is a minimal set of coordinates to which f is sensitive on input x. Note
that jBj k. Let C = fS jS 2 Bg be the coordinates that occur in the
sets in B.
Then C is a certicate for f on input x. To see why, suppose there
was an input y that agreed with x on all coordinates in C such that
f(x) 6= f(y). Let C be the set of coordinates on which y diers from
0
x. Then f is sensitive to C on input x. Let C be a minimal subset
0 00
of C to which f is sensitive on input x. Since C and C are disjoint,
0 00
C is disjoint from each S 2 B. But then B fC g is a collection of
00 00
disjoint minimal sets of coordinates to which f is sensitive on input x,
contradicting the maximality of B.
To complete the proof, it suces to show that jS j k for each S 2 B,
because this implies that jC j k2 . Consider any S 2 B and let y be
an input that agrees with x on all coordinates not in S and such that
f(x) 6= f(y). Then, by the minimality of S, for each i 2 S, fig is a set
of coordinates to which f is sensitive of input y. Clearly ffigji 2 S g is a
disjoint collection of sets. Since the block sensitivity of f is k, it follows
that jS j k.
THEOREM 21.28 BI87, HH86, Tar88]
For any function f : D1 Dn ! R, decision tree complexity(f )
(certi cate complexity(f))2 .
PROOF
First note that if f(x) 6= f(y), then every certicate C for f on input
x intersects every certicate C for f on input y. Otherwise, it would
0
be possible to construct an input z consistent with x at all coordinates
in C and consistent with y at all coordinates in C , which would imply
0
that f(x) = f(z) = f(y).
Suppose 0 l k and, for some r 2 R, every input x 2 f 1 (r) has a ;
certicate of size at most k and every input y 62 f 1 (r) has a certicate of
;
size at most l. We prove by induction that decision tree complexity(f)
kl.Strana 57
36 Chapter 21. The Complexity of Computation on the PRAM
If jRj = 1, then f is a constant function and decision tree complexity(f) =
0 kl. Therefore, assume jRj > 1.
If f has no input in its domain with value r, consider the function f that
0
is identical to f except that it does not contain r in its codomain. Since
every input has a certicate of size at most l, decision tree complexity(f) =
decision tree complexity(f ) l2 kl.
0
Otherwise, choose an input in f 1 (r) and construct a decision tree T
;
whose rst k levels are labelled by the variables in a length k certicate
for f on that input. For each node v at depth k, consider the restriction
fv of f to the subdomain in which these k variables have the values
specied by the path to v. Since every input x 2 f 1 (r) has a certicate
;
of size at most k and every input y 62 f 1 (r) has a certicate of size
;
at most l that contains at least one of these k variables, every input
x 2 fv 1 (r) has a certicate of size at most k and every input y 62 fv 1 (r)
0 ; 0 ;
has a certicate of size at most l ; 1. Thus fv has a decision tree of height
at most k(l ;1). This tree is rooted at node v of T. The resulting decision
tree T computes f and has height at most k + k(l ; 1) = kl.
EXERCISE 21.29
BSVW86] Prove that, for any monotone Boolean function f, sensitivity(f)
= certi cate complexity(f).
Both log(block sensitivity(f)) and log(decision tree complexity(f)) char-
acterize the time to compute a function f : D1 Dn ! R on a CREW
PRAM to within a small constant factor. This follows from Theorems 21.27
and 21.28, the fact that logb (block sensitivity(f)) is a lower bound on the
time for a CREW PRAM to compute f (Theorem 21.26), and the fact that
1 + dlog2(decision tree complexity(f))e is an upper bound on the time for a
CROW PRAM and, hence, a CREW PRAM to compute f (Theorem 21.22).
Moreover, given a CREW PRAM algorithm computing such a function, it
is possible to construct a CROW PRAM algorithm to compute the function
to within a constant factor as fast. However, the new algorithm is not nec-
essarily a step by step simulation of the original algorithm and it might use
exponentially more processors. It is an open question whether such a large
blowup in the number of processors is necessary.
For many of these results, it is essential that the domain of the function
f is complete, i.e. if there are n inputs, then the domain of f can be expressed
as the direct product of n sets of values. In other words, the value of anyStrana 58
21.4 Lower Bound Techniques 37
input variable is not constrained by the values of other input variables. The
results are not necessarily true when the domain of f is not complete.
For example, the OR of n bits when it is known that at most one bit is
1 can be computed in one step on a CREW PRAM, but the sensitivity of this
function is n and it requires log2 n steps on a CROW PRAM. (See Exercises
21.20 and 21.17.) Similarly, reversing a singly linked list or a disjoint collection
of circular singly linked lists of length n has decision tree complexity n ; 1
(Exercise 21.18), although this can be done in one step on a CREW PRAM
FR90]. In contrast, when reverse pointers are also present, the situation is
quite dierent.
THEOREM 21.29 FR90]
A CROW PRAM can compute any function of a disjoint collection of
circular doubly linked lists to within a constant factor as fast as a CREW
PRAM.
Additional work needs to be done to understand what problems can be
solved more quickly by a CREW PRAM than by a CROW PRAM.
Lower bounds that exactly match the upper bounds for certain Boolean
functions have been obtained by considering yet another property. For any
Boolean function f : f0 1gn ! f0 1g, there is a unique multilinear polynomial
X Y
Pf (x) = aI xi
I f1:::ng i2I
with integer coecients that represents f in the sense that Pf (x) = f(x)
whenever x 2 f0 1gn. Moreover, the coecients aI have absolute value at
most 2n 1 Smo87]. For example, if f(x1 : : : xn) = x1 ^ ^ xn, then
;
Pf (x1 : : : xn) = x1 xn and if f(x1 : : : xn) = x1_ _xn, then Pf (x1 : : : xn) =
1 ; (1 ; x1) (1 ; xn). The degree of the Boolean function f is dened to
be the degree of the polynomial Pf .
EXERCISE 21.30
DKR90] Prove the following facts about the degrees of Boolean func-
tions.
1. degree(f) = degree(f).
2. degree(f ^ g) degree(f) + degree(g).
3. degree(f _ g) degree(f) + degree(g).Strana 59
38 Chapter 21. The Complexity of Computation on the PRAM
4. If f ^ g 0, then degree(f _ g) maxfdegree(f), degree(g)g.
Since the OR of n bits has degree n, the following result shows that the
algorithm in Exercise 21.19 is the fastest possible on a CREW PRAM.
THEOREM 21.30 DKR90]
If a CREW PRAM computes a Boolean function f : f0 1gn ! f0 1g in
t steps, then F2t+1 degree(f).
PROOF
Without loss of generality, we make the following two assumptions.
First, a processor's state is merely (an encoding of) the sequence of
values it has read at each step (so a processor never forgets informa-
tion). Second, whenever a processor writes, it identies itself and its
current state (i.e. it communicates everything it knows).
For each processor, partition the set of inputs f0 1gn so that inputs are
in the same block if and only if they cause the processor to be in the
same state immediately after step t. Let s(t) denote the maximum de-
gree of the characteristic function of any block of one of these partitions.
Similarly, let c(t) denote the maximum degree of the characteristic func-
tion of any block of the partition of f0 1gn induced by a shared memory
cell's contents immediately after step t. Then, as in the proof of Theorem
21.23,
s(0) = 0
c(0) = 1 and
s(t) s(t ; 1) + c(t ; 1) for t > 0:
Now consider the characteristic function of any block of the partition
induced by the contents of a shared memory cell M, immediately after
step t. If, in state q, processor Pi writes into M during step t, then the
characteristic function giq : f0 1gn ! f0 1g of the set of inputs that
cause Pi to be in state q at time t is also the characteristic function of
the set of inputs for which M contains the value (i q) immediately after
step t. This function has degree at most s(t), by denition.
Suppose, instead, that no processors write to M during step t and M
contains the value v immediately after step t. Let g be the characteris-
tic function identifying those inputs for which M contains the value vStrana 60
21.4 Lower Bound Techniques 39
immediately after step t ; 1. Then the characteristic function g of the 0
corresponding block of the partition induced by M's contents at time t
can be expressed as _
g = g ^ giq
0
iq
where the OR is taken over all values of i and q such that, in state q,
processor Pi writes to M during step t. Since concurrent writes to M
cannot occur, at most one of the functions giq has
0 value 11 for a given
_
input. From Exercise 21.30, it follows that degree @ giq A s(t). By
iq
denition, degree(g) c(t ; 1). Thus degree(g ) c(t ; 1) + s(t) and,
0
hence,
c(t) c(t ; 1) + s(t) for t > 0:
It is easy to prove by induction that s(t) F2t and c(t) F2t+1 for
all t 0. The theorem then follows from the fact that degree(f) is
bounded above by the maximum degree of the characteristic functions
of the blocks of the partition induced by M1 's contents at the end of the
computation.
EXERCISE 21.31
DKR90] Let n = k2 and consider the function
f(x1 : : : xn) = (x1 ^ ^ xk ) _ _ (xn k+1 ^ ^ xn):
;
Prove that the certicate complexity of f is k and its degree is n. What
is its decision tree complexity?
EXERCISE 21.32
Prove that the degree of any Boolean function is always bounded above
by its decision tree complexity.
THEOREM 21.31 Sze89]
The block sensitivity of any Boolean function is bounded above by the
square of its degree.
For special classes of Boolean functions, better results can be shown.Strana 61
40 Chapter 21. The Complexity of Computation on the PRAM
EXERCISE 21.33
DKR90] Prove that the sensitivity of a monotone Boolean function is
bounded above by its degree.
THEOREM 21.32 DKR90]
Any nonconstant, symmetric, Boolean function of n variables has degree
larger than n=2.
The time taken by a randomized CREW PRAM to compute a Boolean
function is also related to the function's block sensitivity and degree Nis89,
DKR90]. These results imply that a randomized CREW PRAM cannot com-
pute a Boolean function more than a constant factor faster than a determin-
istic CROW PRAM.
21.4.2 Simplifying the Algorithmic Structure by
Restricting the Input
A variety of lower bounds for the PRAM have been obtained as follows.
Given an algorithm to solve a problem, nd a restricted problem, either on
a smaller number of input variables or a smaller input domain, on which the
algorithm behaves in a considerably simpler manner. Then lower bounds are
proved directly for this class of simpler algorithms.
It is useful to consider the partitions of the set of inputs into blocks
that are indistinguishable to a processor or a memory cell during the rst t
steps of the computation. As time increases, these partitions become more
complicated. The proofs of the next results use the technique of random
restrictions to show that these partitions do not become suciently complex
too quickly. Essentially, after each step of the computation, the values of
relatively few randomly chosen bits are xed. This leaves only a slightly
smaller instance of the problem. It can be shown that the resulting partitions
of the set of inputs are likely to remain quite simple in structure.
THEOREM 21.33 BH89]
If PRIORITY computes the PARITY of n input bits in t steps, then it
uses 2 (n1=t) processors and 2 ((n=t!)1=t) shared memory cells.
COROLLARY 21.34 BH89]Strana 62
21.4 Lower Bound Techniques 41
PRIORITY with n0(1) processors or shared memory cells requires (log n= loglog n)
steps to compute the PARITY of n input bits.
THEOREM 21.35 BH89]
For all t 2 31 log n=log log n ; !(log n=(log logn)2 ), there is a Boolean
function of n variables that can be computed by COMMON in t steps
using n processors and memory cells, but cannot be computed by PRI-
ORITY in t ; 1 steps using n0(1) processors or using nO(1) memory
cells.
Thus, even one extra time step can be more useful than increasing the number
of processors or shared memory cells by a polynomial factor or using a more
powerful write resolution rule.
An early example of this approach was applied to the following problem
Sni85].
SEARCH AN ORDERED LIST
Given x1 : : : xn y 2 f1 : : : rg such that x1 x2 : : : xn ,
determine that either y < x1 or xn y or nd the index i such
that xi y < xi+1 .
Using (p + 1)-ary search, a CREW PRAM with p processors can solve this
problem in time O(log n= log(p + 1)). A nontrivial lower bound can be ob-
tained on an EREW PRAM, even for the restricted version of the problem
in which x1 : : : xn y 2 f0 1g, by bounding the number of variables aecting
processors and memory cells as a function of time. This lower bound remains
valid even if concurrent writes are allowed.
EXERCISE 21.34
Sni85] Prove that an EREW PRAM with p processors requires (log n;
logp) steps to SEARCH AN ORDERED LIST of length n.
EXERCISE 21.35
Sni85] Prove that an EREW PRAM can SEARCH AN ORDERED
LIST of length n in O(log n ; log p) steps using p processors, provided
p copies of y are given as part of the input.
EXERCISE 21.36Strana 63
42 Chapter 21. The Complexity of Computation on the PRAM
Sni85] Prove that an EREW
p PRAM can SEARCH AN ORDERED
LIST of length n in O( log n) steps using n processors and memory
cells.
Even with an arbitrarily large number of processors, the upper bound
in Exercise 21.36 cannot be improved. This is a consequence of the limited
ability of the EREW PRAM to access the input variable y.
THEOREM 21.36 Sni85]
p
An EREW PRAM requires ( log n) steps to SEARCH AN ORDERED
list of length n, provided the domain size, r, is suciently large.
The idea of the proof is to show that any algorithm has a simple structure
for a large subset of the inputs. Then a lower bound is obtained assuming
this simple structure.
Two inputs x x 2 Dn are order equivalent if, for all i j 2 f1 : : : ng,
0
xi < xj if and only if xi < xj :
0 0
A function f : Dn ! R depends only on the relative values of its variables if
f(x) = f(x ) for all order equivalent inputs x x 2 Dn . The address functions
0 0
of a PRAM algorithm are the functions of the input that describe where in
shared memory each processor reads from and writes to at each step in the
computation. A PRAM algorithm depends only on the relative values of its
variables if its address functions depend only on the relative values of their
variables.
LEMMA 21.37
Consider any EREW PRAM algorithm to SEARCH AN ORDERED
LIST. If the domain size r is suciently large (as a function of the
number of processors, memory cells, and time steps), then there is a
large subset S f1 : : : rg such that the EREW PRAM algorithm de-
pends only on the relative values of its variables when restricted to inputs
x1 : : : xn y 2 S .
This lemma can be proved by applying the following result from Ramsey
theory to each of the address functions.Strana 64
21.4 Lower Bound Techniques 43
THEOREM 21.38 GRS80]
Given a function f : Dn ! R, where jDj is suciently large in terms
of n, jRj, and s, there is a subset S D, with jS j s, such that f jS n
depends only on the relative values of its variables.
Finally, any EREW PRAM algorithm to SEARCH AN ORDERED LIST that
depends only on the relative values of its variables can be shown to require
( log n) steps, using arguments similar to those needed for Exercise 21.34
Sni85].
Furthermore, if the EREW PRAM algorithm stores at most a con-
stant number of input values in each shared memory cell, then it requires
(log n= loglog n) steps to SEARCH AN ORDERED LIST. If, in addition,
each processor only stores a constant number of input values in its local mem-
ory, (log n) steps are necessary.
These results demonstrate that concurrent read can be more powerful
than exclusive read (because a CREW PRAM with n processors can SEARCH
AN ORDERED LIST of n numbers in O(1) steps), although not necessarily
for problems with complete domains. However, the lower bounds in Exercise
21.34 and Theorem 21.36 also apply to any problem that has SEARCH AN
ORDERED LIST as a special case. Unfortunately, the natural extension to a
complete domain, the problem of searching an unordered list, is also dicult
for a CREW PRAM.
EXERCISE 21.37
Prove that a CREW PRAM requires (log n) time to search an un-
ordered list x1 : : : xn for an element y.
It is possible to construct a problem with complete domain that has
SEARCH AN ORDERED LIST as a special case and can be computed sub-
stantially more quickly by a CREW PRAM than by an EREW PRAM GNR89].
A comparison tree is a binary decision tree in which each internal node is la-
belled by a comparison between two input variables instead of by a single input
variable. The two children of an internal node correspond to the outcomes <
and of the comparison. As in a decision tree, computation begins at the
root and the output is the label of the leaf that is reached. Consider the fol-
lowing comparison tree with n internal nodes and depth dlog2 ne, an example
of which is illustrated in Figure 21.3. At the ith internal node (encountered in
an inorder traversal), the input variables y and xi are compared. The leaves
are labelled sequentially from left to right with the numbers 0 1 : : : n. TheStrana 65
44 Chapter 21. The Complexity of Computation on the PRAM
problem is to determine the output of this comparison tree, given the input
x1 : : : xn y 2 f1 : : : rg. When the input variables x1 : : : xn are restricted
to be in sorted order, this problem is equivalent to SEARCH AN ORDERED
LIST.
y : x4
<
y : x2 y : x6
< <
y : x1 y : x3 y : x5 y : x7
< < < <
0 1 2 3 4 5 6 7
FIGURE 21.3
The comparison tree problem, for n = 7.
EXERCISE 21.38
GNR89] Prove that on a CREW PRAM, the comparison problem can be
solved in O(log log n) steps, using n processors, and requires (log log n)
steps, regardless of the number of processors.
EXERCISE 21.39
Prove that an exclusive
p read PRAM with an innite number of proces-
sors requires ( logp= loglog p) steps to simulate a general step of a
concurrent read PRAM with p processors, when the input domain size
is suciently large.
EXERCISE 21.40
GNR89] Construct a function with domain f0 1gn and range f0 1g
that can be solved in O(loglog n) steps on a CREW PRAM using n
processors, but requires (log n ; log p) steps to be computed by an
EREW PRAM with p processors.
Lower bounds for a number of other problems have been obtained us-
ing the Ramsey theory technique. The ELEMENT DISTINCTNESS problemStrana 66
21.4 Lower Bound Techniques 45
was studied by Fich, Meyer auf der Heide, and Wigderson FMW87]. They
showed that COMMON with n processors and an innite amount of shared
memory requires (log log logn) steps to solve this problem for a domain
of size 2 (n log n) . Ragde, Steiger, Szemeredi,
p and Wigderson RSSW88] im-
proved the lower bound on time to ( log n) and Boppana Bop89] improved
it further to (log n= loglog n), matching the upper bound. (See Theorem
21.7 and Exercise 21.13.) Both results require substantially larger domains.
Recently, Edmonds Edm91] was able to obtain an (log n= log log n) lower
bound using a domain only doubly exponential in n. Boppana Bop89] and
Edmonds Edm91] also proved that PRIORITY with n processors requires
(log n= loglog n) steps to solve ELEMENT DISTINCTNESS on these do-
mains, when the amount of shared memory does not increase as a function
of the domain size r. On PRIORITY with n processors (or even n(logn)O(1)
processors) and innite memory, Fich, Meyer auf der Heide, and Wigder-
son MW87] showed that nding the MAXIMUM of n elements requires
(log logn) steps (matching Shiloach and Vishkin's upper bound SV81]) and
Meyer auf der Heide pand Wigderson MW87] showed that SORTING a list of
length n requires ( log n) steps. Schieber and Vishkin SV90] used similar
ideas to obtain a tight (log log n) lower bound on the number of steps needed
by PRIORITY with n(log n)0(1) processors and innite memory to merge two
sorted lists of length n or nd the nearest neighbour of each vertex in an
n-vertex convex polygon.
A serious limitation of these lower bounds, and one that is inherent in
the use of Ramsey theory, is that they are only applicable when the size of
the problem domain is very large. For example, provided the domain size
r 2 O(m), ARBITRARY can solve ELEMENT DISTINCTNESS in O(1)
steps using n processors and m shared memory cells. (See Exercise 21.13.)
Berkman and Vishkin BV89] showed that two sorted lists of length n con-
taining numbers in the range f1 : : : rg can be merged on a CREW PRAM
in O(loglog logr) steps using n= loglog log r processors.
EXERCISE 21.41
FRW88a, EG88] Prove that the MAXIMUM of n elements with values
in the range f1 : : : nO(1)g can be found on COMMON using n proces-
sors and memory cells and O(1) time.
At present, it is unknown whether an EREW PRAM can compute every
Boolean function as quickly as a CREW PRAM can. However, the following
Boolean function require substantially more time to compute on an EROWStrana 67
46 Chapter 21. The Complexity of Computation on the PRAM
PRAM than on a CROW PRAM FW90].
BOOLEAN DECISION TREE EVALUATION
Given the values of 2m Boolean variables x0 : : : x2m 1 and (the
;
binary encoding of) a complete decision tree of height h, in which
each internal node is labelled by one of these 2m variables and the
leaves are alternately labelled 0 and 1, determine the label of the
leaf that is reached. The size of the input is n = 2m + m(2h ; 1).
For example, when m = 2 and h = 3, the rst four bits of the input
111000100111100110 represent the values x0 = 1, x1 = 1, x2 = 1, and x3 = 0.
The remaining bits represent the indices of the labels of the interior nodes
of the decision tree illustrated in Figure 21.4, when these nodes are arranged
according to the inorder traversal of the tree.
x3
x2 x1
x0 x1 x2 x2
0 1 0 1 0 1 0 1
FIGURE 21.4
The Boolean decision tree of height 3 represented by the encoding 00100111100110.
THEOREM 21.39 FW90]
For an appropriate choice of h and m, a CROW PRAM can solve the
BOOLEAN DECISION TREE EVALUATION problem in O(log logn)
steps, but anyp (probabilistic) EROW PRAM that computes it requires
(expected) ( log n) steps.
From Theorem 21.22, it follows that a CROW PRAM can solve the
BOOLEAN DECISION TREE EVALUATION problem in time O(log m +
log h). To prove the lower bound, it is sucient to consider the behaviour ofStrana 68
21.4 Lower Bound Techniques 47
an EROW PRAM algorithm on a randomly chosen input from f0 1gn. After
each step of the EROW PRAM computation, a longer initial segment of the
root to leaf path is revealed. It is very unlikely that any processor knows both
the value of a variable that has not been revealed and (any bit of) the label of
a node in the subtree rooted at the end of this initial segment. In particular,
for suciently short computations, if a processor knows anything about which
variable labels the parent of the leaf that is reached, then it is unlikely to know
anything about the value of that variable. It is conjectured that an EREW
PRAM requires (logn) (1) steps to solve the BOOLEAN DECISION TREE
EVALUATION problem, for appropriate h and m.
21.4.3 Small Memory
There are certain techniques that are particularly useful for proving
lower bounds on a PRAM with a small amount of shared memory. When a
PRAM has only one shared memory cell, we may assume that all processors
read the shared memory cell at every step of the computation. The sequence of
contents of the m shared memory cells is called the history of the computation.
Then a processor's knowledge can be expressed as a function of the history
and any input values initially located in its local memory. A lower bound
of (t=m) steps on a PRAM using m shared memory cells can be obtained
from a lower bound of t steps on a PRAM using one shared memory cell by
applying the result of Exercise 21.3.
The following information theoretic lower bound is obtained by consid-
ering the partition of the set of all possible inputs into blocks that cause the
shared memory to have exactly the same history up to a given point in time.
Since two processors may not attempt to write dierent values in the same
memory cell at the same time, it must be possible to infer some \mutual
exclusion" information from the history. This mutual exclusion information
takes time to set up and is not reusable, so the partition cannot become too
ne too quickly.
THEOREM 21.40 FRW88a]
(Probabilistic) COMMON with n processors and m shared memory cells
that computes a surjective function f : f0 1gn ! R requires (expected
time) ((log jRj)=m) steps for some input.
EXERCISE 21.42Strana 69
48 Chapter 21. The Complexity of Computation on the PRAM
FRW88a] Prove that COMMON with p processors and m shared mem-
ory cells requires (log(p=m)) steps to simulate one step of ARBI-
TRARY or PRIORITY with p processors and m memory cells.
Adversary arguments are also useful for proving lower bounds with small
amounts of shared memory. As the computation proceeds, the adversary xes
the history by xing the values of some of the input variables and, in the case
of ARBITRARY, the outcomes of write conicts. Unless the computation is
suciently long, this information is insucient to determine the answer.
THEOREM 21.41 FRW88a]
ARBITRARY with p processors and m shared memory cells requires
(log(p=m)) steps to simulate one step of PRIORITY with p processors
and m memory cells.
PROOF
It suces to prove the lower bound for the following restricted version
of the LEFTMOST WRITERS problem. The p processors are divided
into m groups of approximately equal size and, for i 2 f1 : : : mg, each
processor in group i must have value 0 or i. Consider any ARBITRARY
algorithm that solves this problem.
A processor whose value has not yet been xed by the adversary is said
to be free. The adversary xes the values of certain processors after each
step of the computation in a way that allows any free processor to be
the leftmost writer in its group. This is accomplished by never xing
the value of a processor to anything but 0 if there is a free processor of
lower index within the same group. As long as there is at least one free
processor, the algorithm cannot have terminated.
Initially, the rightmost processor in each group has its value xed to the
index of the group. The values of the other processors are free. Once the
history for the rst t steps has been xed, the action of each processor
at step t + 1 can be viewed as a function of its value. Based on these
actions, the adversary xes the contents of each of the m shared memory
cells at step t + 1, as follows.
If possible, the adversary xes the contents of a cell by choosing the
value written at step t + 1 by a processor whose value has already been
xed. Otherwise, if there is a free processor that writes to the cell at step
t + 1 when its value is 0, the adversary xes the value of that processorStrana 70
21.4 Lower Bound Techniques 49
to 0 and xes the contents of the cell by choosing the value written by
that processor.
The remaining unxed cells are only written to by free processors when
their values are nonzero. If possible, the adversary xes the contents of
such a cell by choosing the value written by a processor among the right
half of the free processors in some group. The value of the processor is
set to the index of its group to ensure that the write takes place. The
value of every free processor of higher index within the same group is
xed to 0.
Note that each time the contents of a cell is xed in one of these ways,
the number of possible answers decreases by at most a factor of 2.
Finally, the left half of the free processors in each group have their values
xed to 0, ensuring that no processors write to the remaining unxed
cells. The adversary xes the contents of those cells to be the same as
at step t. This decreases the number of possible answers by a factor of
2m .
Initially, there are ((n=m)m ) possible answers. This is because any
processor in group i can be the leftmost writer to cell i. At each step,
the total number of possible answers decreases by a factor of at most
22m . Thus any algorithm must perform (log(n=m)) steps to determine
the answer in the worst case.
EXERCISE 21.43
FLRY89] Prove that PRIORITY with an innite number of proces-
sors and m shared memory cells requires (n=m) steps to determine
whether the input x1 : : : xn contains two consecutive variables with
value 1, assuming that the input is initially located in the processors'
local memories.
EXERCISE 21.44
Prove that a CREW PRAM with an innite number of processors and
m shared memory cells requires (n=m) steps to compute the OR of n
input bits that are initially located in the processors' local memories.
21.4.4 ReductionsStrana 71
50 Chapter 21. The Complexity of Computation on the PRAM
The most frequently used lower bound technique is to nd a problem
for a which a good lower bound is known and reduce it to the problem of
interest. Specically, let f be a problem with n input variables and let g be
a problem with n input variables. Suppose a PRAM can map each instance
0
(x1 : : : xn) of f to an instance (y1 : : : yn0 ) of g in t steps and can map the
answer g(y1 : : : yn0 ) to the answer f(x1 : : : xn) in t steps. If the PRAM
0
requires at least T steps to solve f, then it requires at least T ; t ; t steps0
to solve g. To get a nontrivial lower bound for g, it is important that the
time t + t used to perform the reduction is signicantly less than the lower
0
bound T for f. For example, the problem of merging two sorted lists of length
n can be reduced in constant time to the problem of triangulating a mono-
tone polygon with (n) vertices by an EREW PRAM using O(n) processors
BSV88]. Since the former problem requires (loglog n) steps on PRIOR-
ITY with n(log n)O(1) processors (see Section 21.21), it follows that this lower
bound also applies to the latter problem. More generally, if merging two sorted
lists of length n can be reduced to a problem with O(n) inputs on PRIOR-
ITY using o(log log n) steps and n(log n)O(1) processors, then that problem
requires (loglog n) steps on PRIORITY with n(logn)O(1) processors.
Sometimes it is possible to express each component of g's input as a
function of at most one component of f's input and each component of f's
answer as a function of at most one component of g's answer. Then f can
be reduced to g by a CREW PRAM in constant time using one processor
for each component of g's input and one processor for each component of f's
answer. This type of reduction is called a projection.
THEOREM 21.42
The PARITY of n bits can be reduced via a projection to LIST RANK-
ING in a list of length 3n + 1.
PROOF
Consider an instance (x1 : : : xn) of the PARITY problem. Create an
instance of LIST RANKING with nodes 1 : : : 2n + 1 arranged in order
starting at 1. If xi = 0, then node 2n + i + 1 is placed between nodes
n + i and n + i + 1 and if xi = 1, then it is placed between nodes i and
i + 1. In particular, the distance from node n + 1 to the end of the list
is the number of input variables with value 1 plus twice the number of
input variables with value 0. Hence, the least signicant bit of the rank
of node n + 1 is the PARITY of x1 : : : xn.Strana 72
21.4 Lower Bound Techniques 51
From Theorems 21.23 and 21.33, it follows that LIST RANKING re-
quires (log n) steps on a CREW PRAM using any number of processors
and (log n= loglog n) steps on PRIORITY using a polynomial number of
processors.
EXERCISE 21.45
FSS84] Prove that there is a projection mapping the PARITY of n bits
to the MULTIPLICATION of two n bit numbers.
Many other examples of projections appear in SV85] and CSV84].
An analogue of Turing reducibility is also useful for proving lower bounds.
Let f be a problem with n inputs and let fgig be a family of problems where
gi has i inputs. Suppose there is a PRAM that can solve f using t time steps
and p processors, given access to an oracle that solves any instance of gi for
i s(n). If the PRAM can solve gi using t (i) time steps and p (i) proces-
0 0
sors, then it can solve f using at most t t (s(n)) time steps and p p (s(n))
0 0
processors. Furthermore, if no input to an oracle computation is a function of
an output of another oracle computation, then the PRAM can solve f using
only t + t (s(n)) time steps.
0
A special case of such a reduction is one that can be performed by a
constant depth, polynomial size unbounded fan-in Boolean circuit with gates
that compute any bit of the output of a problem in fgig. This reduction
can also be performed by COMMON (and, hence, PRIORITY) in constant
time using a polynomial number of processors. (See Theorem 21.10.) Con-
stant depth, polynomial size reductions from determining the PARITY of
n bits to ADDING, SORTING, and determining the MAJORITY of O(n)
bits and computing the TRANSITIVE CLOSURE of an n + 2 node graph
FSS84, CSV84] imply (log n= loglog n) lower bounds for these problems on
PRIORITY with n0(1) processors. Other examples of constant depth, poly-
nomial size reductions appear in FSS84, CSV84, ACG+ 88].
EXERCISE 21.46
CSV84] Prove that SORTING n n-bit integers is reducible to determin-
ing (the bits of) the binary representation of the sum of n bits and vice
versa.
Acknowledgements
I am grateful to Je Edmonds, David Neto, Naomi Nishimura, Prabhakar
Ragde, and Jeannine St. Jacques for carefully reading various drafts of thisStrana 73
52 Chapter 21. The Complexity of Computation on the PRAM
chapter and making valuable comments. Preparation of this chapter was sup-
ported by the Natural Sciences and Engineering Research Council of Canada
(grant A9176) and the Information Technology Research Centre of Ontario.Strana 74
Bibliography
ACF90] S. Akl, M. Cosnard, and A. Ferreira. Data-movement-intensive
problems: Two folk theorems in parallel computation revisited.
Technical Report 90-18, Ecole Normale Sup#erieure de Lyon, June
1990.
ACG+ 88] A. Aggarwal, B. Chazelle, L. Guibas, C. O'Dunlaing, and C. Yap.
Parallel computational geometry. Algorithmica, 3:293{327, 1988.
AHMP87] H. Alt, T. Hagerup, K. Mehlhorn, and F. Preparata. Deterministic
simulation of idealized parallel computers on more realistic ones.
SIAM Journal on Computing, 16(5):808{835, October 1987.
AHU74] A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of
Computer Algorithms. Addison-Wesley, 1974.
Bea87] P. Beame. Lower Bounds in Parallel Machine Computation. PhD
thesis, University of Toronto, 1987. Department of Computer
Science Tech. Report 198/87.
Bea88] P. Beame. Limits on the power of concurrent-write parallel ma-
chines. Information and Computation, 76(1):13{28, 1988.
Bel88] S. Bellantoni. Parallel RAMs with bounded memory wordsize.
Master's thesis, University of Toronto, 1988.
Bel91] S. Bellantoni. Parallel RAMs with bounded memory wordsize.
Information and Computation, 91(2):259{273, April 1991.
BH89] P. Beame and J. Hastad. Optimal bounds for decision problems
on the CRCW PRAM. Journal of the ACM, 36(3):643{670, 1989.
53Strana 75
54 Chapter 21. The Complexity of Computation on the PRAM
BI87] M. Blum and R. Impagliazzo. Generic oracles and oracle classes.
In FOCS, pages 118{126, 1987.
Bop89] R. Boppana. Optimal separations between concurrent-write par-
allel machines. In STOC, pages 320{326, 1989.
Bor77] A. Borodin. On relating time and space to size and depth. SIAM
Journal on Computing, 6(4):733{744, December 1977.
BSV88] O. Berkman, B. Schieber, and U. Vishkin. Some doubly loga-
rithmic optimal parallel algorithms based on nding all nearest
smaller values. Technical Report UMIACS-TR-88-79, submitted
to J.Algorithms, University of Maryland, 1988.
BSVW86] S. Bublitz, U. Sch&urfeld, B. Voigt, and I. Wegener. Properties
of complexity measures for PRAMs and WRAMs. Theoretical
Computer Science, 48:53{73, 1986.
BV89] O. Berkman and U. Vishkin. Recursive *-tree parallel data struc-
ture. In FOCS, pages 196{202, 1989.
CDHR88] B. Chelbus, K. Diks, T. Hagerup, and T. Radzik. Ecient sim-
ulations between CRCW PRAMs. In 13th MFCS, Lecture Notes
in Computer Science 324, pages 230{239. Springer-Verlag, 1988.
CDR86] S. Cook, C. Dwork, and R. Reischuk. Upper and lower bounds
for parallel random access machines without simultaneous writes.
SIAM Journal on Computing, 15:87{97, 1986.
Cha91] S. Chaudhuri. Tight bounds for the chaining problem. In Proceed-
ings of the Symposium on Parallel Algorithms and Architectures,
pages 62{70, 1991.
CR73] S. Cook and R. Reckow. Time bounded random access machines.
Journal of Computer and System Sciences, 7(4):354{375, 1973.
CSV84] A. Chandra, L. Stockmeyer, and U. Vishkin. Constant depth
reducibility. SIAM Journal on Computing, 13:423{439, 1984.
DKR90] M. Dietzfelbinger, M. Kutylowski, and R. Reischuk. Exact time
bounds for computing boolean functions on PRAMs without si-
multaneous writes. In Proceedings of the Symposium on Parallel
Algorithms and Architectures, pages 125{137, July 1990.Strana 76
21.4 Lower Bound Techniques 55
DR86] P. Dymond and W.L. Ruzzo. Parallel RAMs with owned global
memory and deterministic context-free language recognition. In
ICALP, pages 95{104, July 1986.
Eck79] D. Eckstein. Simultaneous memory access. Technical Report TR-
79-6, Iowa State University, 1979.
Edm91] J.A. Edmonds. Lower bounds with smaller domain size on concur-
rent write parallel machines. In Structures in Complexity Theory,
pages 322{333, July 1991.
EG88] D. Eppstein and Z. Galil. Parallel algorithmic techniques for com-
binatorial computing. Ann. Rev. Comput. Sci., 3:233{283, 1988.
FLRY89] F. Fich, M. Li, P. Ragde, and Y. Yesha. Lower bounds for parallel
random access machines with read only memory. Information and
Computation, 83(2):234{244, November 1989.
FMW87] F. Fich, F. Meyer auf der Heide, and A. Wigderson. Lower bounds
for parallel random access machines with unbounded shared mem-
ory. In F. Preparata, editor, Advances in Computing Research,
volume 4, pages 1{15. JAI Press Inc., 1987.
FR90] F. Fich and V. Ramachandran. Lower bounds for parallel compu-
tation on linked structures. In Proceedings of the Symposium on
Parallel Algorithms and Architectures, July 1990.
FRW88a] F. Fich, P. Ragde, and A. Wigderson. Relations between
concurrent-write models of parallel computation. SIAM Journal
on Computing, 17:606{627, 1988. A preliminary version appeared
in the Proceedings of the Third ACM Symposium on Principles
of Distributed Computing, 1984, pages 179-189.
FRW88b] F. Fich, P. Ragde, and A. Wigderson. Simulations among between
concurrent-write PRAMs. Algorithmica, 3:43{51, 1988.
FSS84] M. Furst, J.B. Saxe, and M. Sipser. Parity, circuits, and the
polynomial-time hierarchy. Math. Sys. Th., 17:13{27, 1984.
FW78] S. Fortune and J. Wyllie. Parallelism in random access machines.
In STOC, pages 114{118, 1978.Strana 77
56 Chapter 21. The Complexity of Computation on the PRAM
FW90] F. Fich and A. Wigderson. Towards understanding exclusive read.
SIAM Journal on Computing, 19(4):718{727, August 1990.
GNR89] E. Gafni, J. Naor, and P. Ragde. On separating the EREW and
CROW models. Theoretical Computer Science, 68(3):343{346,
1989.
Gol82] L. Goldschlager. A unied approach to models of synchronous
parallel machines. Journal of the ACM, 29:1073{1086, 1982.
Goo89] M. Goodrich. Intersecting line segments in parallel with an
output-sensitive number of processors. In Proceedings of the Sym-
posium on Parallel Algorithms and Architectures, pages 127{137,
June 1989.
GR90] V. Grolmusz and P. Ragde. Incomparability in parallel computa-
tion. Discrete Applied Mathematics, 29(1):63{78, November 1990.
GRS80] R.L. Graham, B.L. Rothschild, and J.H. Spencer. Ramsey Theory.
John Wiley, New York, 1980.
Has87] J. Hastad. Computational Limitations for Small Depth Circuits.
MIT Press, Cambridge, Mass., 1987.
HH86] J. Hartmanis and L. Hemachandra. Complexity classes without
machines: On complete sets for UP. In ICALP, pages 123{135,
July 1986.
HKP] J. Hoover, M. Klawe, and N. Pippenger. Bounding Fan-Out in
Logical Networks. Journal of the ACM, 31(1):13{18, 1984.
HP89] S. Hornick and F. Preparata. Deterministic P-RAM simulation
with constant redundancy. In Proceedings of the Symposium on
Parallel Algorithms and Architectures, pages 103{109, June 1989.
HR90] T. Hagerup and T. Radzik. Every robust CRCW PRAM can
eciently simulate a PRIORITY PRAM. In Proceedings of the
Symposium on Parallel Algorithms and Architectures, pages 117{
124, 1990.
KR90] R. Karp and V. Ramachandran. A survey of parallel algorithms for
shared-memory machines. In Handbook of Theoretical Computer
Science, volume A. MIT Press, 1990.Strana 78
21.4 Lower Bound Techniques 57
KRS90] C. Kruskal, L. Rudolph, and M. Snir. A complexity theory of e-
cient parallel algorithms. Theoretical Computer Science, 71(1):95{
132, 1990.
Kuc82] L. Kucera. Parallel computation and conicts in memory access.
Information Processing Letters, 14(2):93{96, 1982.
LPV81] G. Lev, N. Pippenger, and L.G. Valiant. A fast parallel algorithm
for routing in permutation networks. IEEE Trans. Comput., C-
30:93{100, 1981.
LY86] M. Li and Y. Yesha. New lower bounds for parallel computation.
In STOC, pages 177{187, 1986.
LY87] M. Li and Y. Yesha. The probabilistic and deterministic parallel
complexity of symmetric functions. In ICALP, pages 326{335,
1987.
LY89] M. Li and Y. Yesha. New lower bounds for parallel computation.
Journal of the ACM, 36(3):671{680, 1989.
MR84] F. Meyer auf der Heide and R. Reischuk. Limits to speed up par-
allel machines by large hardware and unbounded communication.
In FOCS, volume 25, pages 56{64, 1984.
MV84] K. Mehlhorn and U. Vishkin. Randomized and deterministic sim-
ulation of prams by parallel machines with restricted granularity
of parallel memories. Acta Informatica, 21:339{374, 1984.
MW87] F. Meyer auf der Heide and A. Wigderson. The complexity of par-
allel sorting. SIAM Journal on Computing, 16(1):100{107, Febru-
ary 1987.
Nis89] N. Nisan. CREW PRAMS and decision trees. In STOC, pages
327{335, 1989.
NS81] D. Nassimi and S. Sahni. Data broadcasting in SIMD computers.
IEEE Trans. Comput., C-30:101{107, 1981.
NS92] N. Nisan and M. Szegedy. On the degree of boolean functions as
real polynomials. In STOC, 1992.Strana 79
58 Chapter 21. The Complexity of Computation on the PRAM
PY91] I. Parberry and P. Yan. Improved upper and lower time bounds
for parallel random access machines without simultaneous writes.
SIAM Journal on Computing, 20:88{99, 1991.
Raga] P. Ragde. unpublished manuscript.
Ragb] P. Ragde. Processor-time tradeos in PRAM simulations. Journal
of Computer and System Sciences, 44, 1992.
Ran87] A. Ranade. How to emulate shared memory. In FOCS, pages
185{194, 1987.
RSSW88] P. Ragde, W. Steiger, E. Szemeredi, and A. Wigderson. The
parallel complexity of element distinctness is !((log n)1=2). SIAM
J. Dis. Math., 1:399{410, 1988.
Rub] D. Rubinstein. personal communication.
Ruz] W.L. Ruzzo. personal communication, cited in BH89, Bea87].
Sch80] J. T. Schwartz. Ultracomputers. ACM Trans. Programming Lang.
Systems, 2:484{521, 1980.
Sim82] H. Simon. A tight !(log logn) bound on the time for parallel
RAM's to compute nondegenerate boolean functions. Information
and Control, 55:102{107, 1982.
Smi90] Burton Smith. The tera computer system. In Proceedings of
the ACM International Conference on Supercomputing, pages 1{7,
1990.
Smo87] R. Smolensky. Algebraic methods in the theory of lower bounds
for boolean circuit complexity. In STOC, pages 77{82, 1987.
Sni] M. Snir. private communication.
Sni85] M. Snir. On parallel searching. SIAM Journal on Computing,
14(3):688{708, August 1985.
SV81] Y. Shiloach and U. Vishkin. Finding the maximum, merging, and
sorting in a parallel computation model. J. Algorithms, 2:88{102,
1981.Strana 80
21.4 Lower Bound Techniques 59
SV84] L. Stockmeyer and U. Vishkin. Simulation of parallel random
access machines by circuits. SIAM Journal on Computing, 13:404{
422, 1984.
SV85] S. Skyum and L. Valiant. A complexity theory based on boolean
algebra. Journal of the ACM, 32:484{502, 1985.
SV90] B. Schieber and U. Vishkin. Finding all nearest neighbors for
convex polygons in parallel: A new lower bound technique and a
matching algorithm. Discrete Applied Mathematics, 29(1):97{112,
November 1990.
Sze89] G. Szegedy. Algebraic Methods in Lower Bounds for Computa-
tional Models with Limited Communication. PhD thesis, Univer-
sity of Chicago, 1989.
Tar88] G. Tardos. Query complexity, or why is it dicult to separate
npa \ co ; npa from pa by a random oracle a?, 1988. manuscript.
TLR88] J. Trahan, M.C. Loui, and V. Ramachandran. Multiplication,
division, and shift instruction in parallel random access machines.
In Proc. 22nd Conf. Inf. Sci. Syst., pages 126{130, 1988.
Tra88] J. Trahan. Instruction Sets for Parallel Random Access Machines.
PhD thesis, University of Illinois, Urbana-Champaign, 1988.
UW87] E. Upfal and A. Wigderson. How to share memory in a distributed
system. Journal of the ACM, 34:116{127, 1987.
Vis83] U. Vishkin. Implementation of simultaneous memory access in
models that forbid it. J. Alg., 4:45{50, 1983.
VW85] U. Vishkin and A. Wigderson. Trade-os between depth and
width in parallel computation. SIAM Journal on Computing,
14:303{314, 1985. FOCS 1983, pages 146-153.
WZ88] I. Wegener and L. Z#adori. A note on the relations between crit-
ical and sensitive complexity. Technical Report 256, Universit&at
Dortmund, 1988.Strana 81
1
Prefix Sums
and Their Applications
Guy E. Blelloch
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213-3890
35Strana 82
36 Chapter 1. Prefix Sums and Their Applications
1.1
Introduction
Experienced algorithm designers rely heavily on a set of building blocks
and on the tools needed to put the blocks together into an algorithm. The
understanding of these basic blocks and tools is therefore critical to the un-
derstanding of algorithms. Many of the blocks and tools needed for parallel
algorithms extend from sequential algorithms, such as dynamic-programming
and divide-and-conquer, but others are new.
This chapter introduces one of the simplest and most useful building
blocks for parallel algorithms: the all-prefix-sums operation. The chapter de-
fines the operation, shows how to implement it on a PRAM and illustrates
many applications of the operation. In addition to being a useful building
block, the all-prefix-sums operation is a good example of a computation that
seems inherently sequential, but for which there is an efficient parallel algo-
rithm. The operation is defined as follows:
DEFINITION
The all-prefix-sums operation takes a binary associative operator ⊕, and
an ordered set of n elements
[a0 , a1 , ..., an−1 ],
and returns the ordered set
[a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕ ... ⊕ an−1 )].
For example, if ⊕ is addition, then the all-prefix-sums operation on the ordered
set
[3 1 7 0 4 1 6 3],
would return
[3 4 11 11 14 16 22 25].
The uses of the all-prefix-sums operation are extensive. Here is a list of
some of them:
1. To lexically compare strings of characters. For example, to deter-
mine that "strategy" should appear before "stratification" in
a dictionary (see Problem 2).Strana 83
1.1 Introduction 37
2. To add multi precision numbers. These are numbers that cannot
be represented in a single machine word (see Problem 3).
3. To evaluate polynomials (see Problem 6).
4. To solve recurrences. For example, to solve the recurrences
xi = ai xi−1 + bi xi−2 and xi = ai + bi /xi−1 (see Section 1.4).
5. To implement radix sort (see Section 1.3).
6. To implement quicksort (see Section 1).
7. To solve tridiagonal linear systems (see Problem 12).
8. To delete marked elements from an array (see Section 1.3).
9. To dynamically allocate processors (see Section 1.6).
10. To perform lexical analysis. For example, to parse a program into
tokens.
11. To search for regular expressions. For example, to implement the
UNIX grep program.
12. To implement some tree operations. For example, to find the depth
of every vertex in a tree (see Chapter 3).
13. To label components in two dimensional images.
In fact, all-prefix-sums operations using addition, minimum and max-
imum are so useful in practice that they have been included as primitive
instructions in some machines. Researchers have also suggested that a sub-
class of the all-prefix-sums operation be added to the PRAM model as a “unit
time” primitive because of their efficient hardware implementation.
Before describing the implementation we must consider how the def-
inition of the all-prefix-sums operation relates to the PRAM model. The
definition states that the operation takes an ordered set, but does not specify
how the ordered set is laid out in memory. One way to lay out the elements
is in contiguous locations of a vector (a one dimensional array). Another way
is to use a linked-list with pointers from each element to the next. It turns
out that both forms of the operation have uses. In the examples listed above,
the component labeling and some of the tree operations require the linked-list
version, while the other examples can use the vector version.
Sequentially, both versions are easy to compute (see Figure 1.1). The
vector version steps down the vector, adding each element into a sum and
writing the sum back, while the linked-list version follows the pointers while
keeping the running sum and writing it back. The algorithms in Figure 1.1 for
both versions are inherently sequential: to calculate a value at any step, the
result of the previous step is needed. The algorithms therefore require O(n)
time. To execute the all-prefix-sums operation in parallel, the algorithms mustStrana 84
38 Chapter 1. Prefix Sums and Their Applications
proc all-prefix-sums(Out, In) proc all-prefix-sums(Out, In)
i ← 0 i ← 0
sum ← In[0] sum ← In[0].value
Out[0] ← sum Out[0] ← sum
while (i < length) while (In[i].pointer 6= EOL)
i ← i + 1 i ← In[i].pointer
sum ← sum + In[i] sum ← sum + In[i].value
Out[i] ← sum Out[i] ← sum
Vector Version List Version
FIGURE 1.1
Sequential algorithms for calculating the all-prefix-sums operation with oper-
ator + on a vector and on a linked-list. In the list version, each element of In
consists of two fields: a value (.value), and a pointer to the next position in
the list (.pointer). EOL means the end-of-list pointer.
be changed significantly.
The remainder of this chapter is concerned with the vector all-prefix-
sums operation. We will henceforth use the term scan for this operation.1
DEFINITION
The scan operation is a vector all-prefix-sums operation.
Chapters 2, 3 and 4 discuss uses of the linked-list all-prefix-sums operation
and derive an optimal deterministic algorithm for the problem on the PRAM.
Sometimes it is useful for each element of the result vector to contain
the sum of all the previous elements, but not the element itself. We call such
an operation, a prescan.
DEFINITION
The prescan operation takes a binary associative operator ⊕ with identity
I, and a vector of n elements
[a0 , a1 , ..., an−1 ],
1 The term scan comes from the computer language APL.Strana 85
1.2 Implementation 39
and returns the vector
[I, a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕ ... ⊕ an−2 )].
A prescan can be generated from a scan by shifting the vector right by one and
inserting the identity. Similarly, the scan can be generated from the prescan
by shifting left, and inserting at the end the sum of the last element of the
prescan and the last element of the original vector.
1.2
Implementation
This section describes an algorithm for calculating the scan operation in
parallel. For p processors and a vector of length n on an EREW PRAM, the
algorithm has a time complexity of O(n/p + lg p). The algorithm is simple
and well suited for direct implementation in hardware. Chapter 4 shows
how the time of the scan operation with certain operators can be reduced to
O(n/p + lg p/ lg lg p) on a CREW PRAM.
Before describing the scan operation, we consider a simpler problem,
that of generating only the final element of the scan. We call this the reduce
operation.
DEFINITION
The reduce operation takes a binary associative operator ⊕ with identity
i, and an ordered set [a0 , a1 , ..., an−1 ] of n elements, and returns the
value a0 ⊕ a1 ⊕ ... ⊕ an−1 .
Again we consider only the case where the ordered set is kept in a vector.
A balanced binary tree can be used to implement the reduce operation by
laying the tree over the values, and using ⊕ to sum pairs at each vertex (see
Figure 1.2a). The correctness of the result relies on ⊕ being associative. The
operator, however, does not need to be commutative since the order of the
operands is maintained. On an EREW PRAM, each level of the tree can be
executed in parallel, so the implementation can step from the leaves to the
root of the tree (see Figure 1.2b); we call this an up-sweep. Since the tree is
of depth ⌈lg n⌉, and one processor is needed for every pair of elements, the
algorithm requires O(lg n) time and n/2 processors.
If we assume a fixed number of processors p, with n > p, then each
processor can sum an n/p section of the vector to generate a processor sum; theStrana 86
40 Chapter 1. Prefix Sums and Their Applications
(a) Executing a +-reduce on a tree.
for d from 0 to (lg n) − 1
in parallel for i from 0 to n − 1 by 2d+1
a[i + 2d+1 − 1] ← a[i + 2d − 1] + a[i + 2d+1 − 1]
Step Vector in Memory
0 [ 3 1 7 0 4 1 6 3 ]
1 [3 4 7 7 4 5 6 9 ]
2 [3 4 7 11 4 5 6 14 ]
3 [3 4 7 11 4 5 6 25 ]
(b) Executing a +-reduce on a PRAM.
FIGURE 1.2
An example of the reduce operation when ⊕ is integer addition. The boxes
in (b) show the locations that are modified on each step. The length of the
vector is n and must be a power of two. The final result will reside in a[n − 1].
tree technique can then be used to reduce the processor sums (see Figure 1.3).
The time taken to generate the processor sums is ⌈n/p⌉, so the total time
required on an EREW PRAM is:
TR (n, p) = ⌈n/p⌉ + ⌈lg p⌉ = O(n/p + lg p). (1.1)
When n/p ≥ lg p the complexity is O(n/p). This time is an optimal speedup
over the sequential algorithm given in Figure 1.1.
We now return to the scan operation. We actually show how to imple-Strana 87
1.2 Implementation 41
in parallel for each processor i
sum[i] ← a[(n/p)i]
for j from 1 to n/p
sum[i] ← sum[i] + a[(n/p)i + j]
result ← +-reduce(sum)
[ |4 7 1}
{z 0
| {z 5 2} 6
| {z 4 8} 1| {z9 5} ]
processor 0 processor 1 processor 2 processor 3
Processor Sums = [12 7 18 15]
Total Sum = 52
FIGURE 1.3
The +-reduce operation with more elements than processors. We assume that
n/p is an integer.
ment the prescan operation; the scan is then determined by shifting the result
and putting the sum at the end. If we look at the tree generated by the reduce
operation, it contains many partial sums over regions of the vector. It turns
out that these partial sums can be used to generate all the prefix sums. This
requires executing another sweep of the tree with one step per level, but this
time starting at the root and going to the leaves (a down-sweep). Initially,
the identity element is inserted at the root of the tree. On each step, each
vertex at the current level passes to its left child its own value, and it passes
to its right child, ⊕ applied to the value from the left child from the up-sweep
and its own value (see Figure 1.4a).
Let us consider why the down-sweep works. We say that vertex x pre-
cedes vertex y if x appears before y in the preorder traversal of the tree (depth
first, from left to right).
THEOREM 1.1
After a complete down-sweep, each vertex of the tree contains the sum
of all the leaf values that precede it.
PROOF
The proof is inductive from the root: we show that if a parent has the
correct sum, both children must have the correct sum. The root has no
elements preceding it, so its value is correctly the identity element.Strana 88
42 Chapter 1. Prefix Sums and Their Applications
(a) Executing a +-prescan on a tree.
procedure down-sweep(A)
a[n − 1] ← 0 % Set the identity
for d from (lg n) − 1 downto 0
in parallel for i from 0 to n − 1 by 2d+1
t ← a[i + 2d − 1] % Save in temporary
a[i + 2d − 1] ← a[i + 2d+1 − 1] % Set left child
a[i + 2d+1 − 1] ← t + a[i + 2d+1 − 1] % Set right child
Step Vector in Memory
0 [ 3 1 7 0 4 1 6 3 ]
up 1 [3 4 7 7 4 5 6 9 ]
2 [3 4 7 11 4 5 6 14 ]
3 [3 4 7 11 4 5 6 25 ]
clear 4 [3 4 7 11 4 5 6 0 ]
down 5 [3 4 7 0 4 5 6 11 ]
6 [3 0 7 4 4 11 6 16 ]
7 [ 0 3 4 11 11 15 16 22 ]
(b) Executing a +-prescan on a PRAM.
FIGURE 1.4
A parallel prescan on a tree using integer addition as the associative operator
⊕, and 0 as the identity.Strana 89
1.2 Implementation 43
FIGURE 1.5
Illustration for Theorem 1.1.
Consider Figure 1.5. The left child of any vertex has exactly the same
leaves preceding it as the vertex itself (the leaves in region A in the
figure). This is because the preorder traversal always visits the left child
of a vertex immediately after the vertex. By the induction hypothesis,
the parent has the correct sum, so it need only copy this sum to the left
child.
The right child of any vertex has two sets of leaves preceding it, the leaves
preceding the parent (region A), and the leaves at or below the left child
(region B). Therefore, by adding the parent’s down-sweep value, which is
correct by the induction hypothesis, and the left-child’s up-sweep value,
the right-child will contain the sum of all the leaves preceding it.
Since the leaf values that precede any leaf are the values to the left of
it in the scan order, the values at the leaves are the results of a left-to-right
prescan. To implement the prescan on an EREW PRAM, the partial sums
at each vertex must be kept during the up-sweep so they can be used during
the down-sweep. We must therefore be careful not to overwrite them. In fact,
this was the motivation for putting the sums on the right during the reduce
in Figure 1.2b. Figure 1.4b shows the PRAM code for the down-sweep. Each
step can execute in parallel, so the running time is 2 ⌈lg n⌉.
If we assume a fixed number of processors p, with n > p, we can use
a similar method to that in the reduce operation to generate an optimalStrana 90
44 Chapter 1. Prefix Sums and Their Applications
[ |4 7 1}
{z 0| {z5 2} 6| {z4 8} 1| {z9 5} ]
processor 0 processor 1 processor 2 processor 3
Sum = [12 7 18 15]
+-prescan = [0 12 19 37]
[0
| 4{z 11} 12
| 12
{z 17} 19
| 25
{z 29} 37
| 38
{z 47}]
processor 0 processor 1 processor 2 processor 3
FIGURE 1.6
A +-prescan with more elements than processors.
algorithm. Each processor first sums an n/p section of the vector to generate
a processor sum, the tree technique is then used to prescan the processor
sums. The results of the prescan of the processor sums are used as an offset
for each processor to prescan within its n/p section (see Figure 1.6). The time
complexity of the algorithm is:
TS (n, p) = 2(⌈n/p⌉ + ⌈lg p⌉) = O(n/p + lg n) (1.2)
which is the same order as the reduce operation and is also an optimal speedup
over the sequential version when n/p ≥ lg p.
This section described how to implement the scan (prescan) operation.
The rest of the chapter discusses its applications.
1.3
Line-of-Sight and Radix-Sort
As an example of the use of a scan operation, consider a simple line-of-
sight problem. The line-of-sight problem is: given a terrain map in the form of
a grid of altitudes and an observation point X on the grid, find which points
are visible along a ray originating at the observation point (see Figure 1.7).
A point on a ray is visible if and only if no other point between it and the
observation point has a greater vertical angle. To find if any previous point has
a greater angle, the altitude of each point along the ray is placed in a vector
(the altitude vector). These altitudes are then converted to angles and placedStrana 91
1.3 Line-of-Sight and Radix-Sort 45
procedure line-of-sight(altitude)
in parallel for each index i
angle[i] ← arctan(scale × (altitude[i] - altitude[0])/ i)
max-previous-angle ← max-prescan(angle)
in parallel for each index i
if (angle[i] > max-previous-angle[i])
result[i] ← "visible"
else
result[i] ← not "visible"
FIGURE 1.7
The line-of-sight algorithm for a single ray. The X marks the observation
point. The visible points are shaded. A point on the ray is visible if no
previous point has a greater angle.
in the angle vector (see Figure 1.7). A prescan using the operator maximum
(max-prescan) is then executed on the angle vector, which returns to each point
the maximum previous angle. To test for visibility each point only needs to
compare its angle to the result of the max-prescan. This can be generalized to
finding all visible points on the grid. For n points on a ray, the complexity of
the algorithm is the complexity of the scan, TS (n, p) = O(n/p + lg n) on an
EREW PRAM.
We now consider another example, a radix sort algorithm. The algorithm
loops over the bits of the keys, starting at the lowest bit, executing a splitStrana 92
46 Chapter 1. Prefix Sums and Their Applications
procedure split-radix-sort(A, number-of-bits)
for i from 0 to (number-of-bits − 1)
A ← split(A, Ahii)
A = [5 7 3 1 4 2 7 2]
Ah0i = [1 1 1 1 0 0 1 0]
A ← split(A, Ah0i) = [4 2 2 5 7 3 1 7]
Ah1i = [0 1 1 0 1 1 0 1]
A ← split(A, Ah1i) = [4 5 1 2 2 7 3 7]
Ah2i = [1 1 0 0 0 1 0 1]
A ← split(A, Ah2i) = [1 2 2 3 4 5 7 7]
FIGURE 1.8
An example of the split radix sort on a vector containing three bit values. The
Ahni notation signifies extracting the nth bit of each element of the vector A.
The split operation packs elements with a 0 flag to the bottom and with a 1
flag to the top.
operation on each iteration (assume all keys have the same number of bits).
The split operation packs the keys with a 0 in the corresponding bit to the
bottom of a vector, and packs the keys with a 1 in the bit to the top of
the same vector. It maintains the order within both groups. The sort works
because each split operation sorts the keys with respect to the current bit
(0 down, 1 up) and maintains the sorted order of all the lower bits since we
iterate from the bottom bit up. Figure 1.8 shows an example of the sort.
We now consider how the split operation can be implemented using a
scan. The basic idea is to determine a new index for each element and then
permute the elements to these new indices using an exclusive write. To de-
termine the new indices for elements with a 0 in the bit, we invert the flags
and execute a prescan with integer addition. To determine the new indices of
elements with a 1 in the bit, we execute a +-scan in reverse order (starting at
the top of the vector) and subtract the results from the length of the vector
n. Figure 1.9 shows an example of the split operation along with code to
implement it.
Since the split operation just requires two scan operations, a few steps
of exclusive memory accesses, and a few parallel arithmetic operations, it has
the same asymptotic complexity as the scan: O(n/p + lg p) on an EREWStrana 93
1.4 Recurrence Equations 47
procedure split(A, Flags)
I-down ← +-prescan(not(Flags))
I-up ← n - +-scan(reverse-order(Flags))
in parallel for each index i
if (Flags[i])
Index[i] ← I-up[i]
else
Index[i] ← I-down[i]
result ← permute(A, Index)
A = [5 7 3 1 4 2 7 2]
Flags = [1 1 1 1 0 0 1 0]
I-down = [0 0 0 0 0 1 2 2 ]
I-up = [ 3 4 5 6 6 6 7 7]
Index = [3 4 5 6 0 1 7 2]
permute(A, Index) = [4 2 2 5 7 3 1 7]
FIGURE 1.9
The split operation packs the elements with a 0 in the corresponding flag
position to the bottom of a vector, and packs the elements with a 1 to the
top of the same vector. The permute writes each element of A to the index
specified by the corresponding position in Index.
PRAM.2 If we assume that n keys are each O(lg n) bits long, then the overall
algorithm runs in time:
n n
O(( + lg p) lg n) = O( lg n + lg n lg p).
p p
1.4
Recurrence Equations
This section shows how various recurrence equations can be solved using
the scan operation. A recurrence is a set of equations of the form
xi = fi (xi−1 , xi−2 , · · · , xi−m ), m≤i<n (1.3)
2 On an CREW PRAM we can use the scan described in Chapter 4 to get a time of O(n/p +
lg p/ lg lg p).Strana 94
48 Chapter 1. Prefix Sums and Their Applications
along with a set of initial values x0 , · · · , xm−1 .
The scan operation is the special case of a recurrence of the form
a0 i=0
xi = (1.4)
xi−1 ⊕ ai 0 < i < n,
where ⊕ is any binary associative operator. This section shows how to reduce
a more general class of recurrences to equation (1.4), and therefore how to
use the scan algorithm discussed in Section 1.2 to solve these recurrences in
parallel.
1.4.1 First-Order Recurrences
We initially consider first-order recurrences of the following form
b0 i=0
xi = (1.5)
(xi−1 ⊗ ai ) ⊕ bi 0 < i < n,
where the ai ’s and bi ’s are sets of n arbitrary constants (not necessarily scalars)
and ⊕ and ⊗ are arbitrary binary operators that satisfy three restrictions:
1. ⊕ is associative (i.e. (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)).
2. ⊗ is semiassociative (i.e. there exists a binary associative operator
⊙ such that (a ⊗ b) ⊗ c = a ⊗ (b ⊙ c)).
3. ⊗ distributes over ⊕ (i.e. a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ c)).
The operator ⊙ is called the companion operator of ⊗. If ⊗ is fully associative,
then ⊙ and ⊗ are equivalent.
We now show how (1.5) can be reduced to (1.4). Consider the set of
pairs
ci = [ai , bi ] (1.6)
and define a new binary operator • as follows:
ci • cj ≡ [ci,a ⊙ cj,a , (ci,b ⊗ cj,a ) ⊕ cj,b ] (1.7)
where ci,a and ci,b are the first and second elements of ci , respectively.
Given the conditions on the operators ⊕ and ⊗, the operator • is asso-
ciative as we show below:
(ci • cj ) • ck
= [ci,a ⊙ cj,a , (ci,b ⊗ cj,a ) ⊕ cj,b ] • ck
= [(ci,a ⊙ cj,a ) ⊙ ck,a , (((ci,b ⊗ cj,a ) ⊕ cj,b ) ⊗ ck,a ) ⊕ ck,b ]Strana 95
1.4 Recurrence Equations 49
= [ci,a ⊙ (cj,a ⊙ ck,a ), ((ci,b ⊗ cj,a ) ⊗ ck,a ) ⊕ ((cj,b ⊗ ck,a ) ⊕ ck,b )]
= [ci,a ⊙ (cj,a ⊙ ck,a ), (ci,b ⊗ (cj,a ⊙ ck,a )) ⊕ ((cj,b ⊗ ck,a ) ⊕ ck,b )]
= ci • [cj,a ⊙ ck,a , (cj,b ⊗ ck,a ) ⊕ ck,b ]
= ci • (cj • ck )
We now define the ordered set si = [yi , xi ], where the yi obey the recur-
rence
a0 i=0
yi = (1.8)
yi−1 ⊙ ai 0 < i < n,
and the xi are from (1.5). Using (1.5), (1.6) and (1.8) we obtain:
s0 = [y0 , x0 ]
= [a0 , b0 ]
= c0
si = [yi , xi ] 0<i<n
= [yi−1 ⊙ ai , (xi−1 ⊗ ai ) ⊕ bi ]
= [yi−1 ⊙ ci,a , (xi−1 ⊗ ci,a ) ⊕ ci,b ]
= [yi−1 , xi−1 ] • ci
= si−1 • ci .
Since • is associative, we have reduced (1.5) to (1.4). The results xi are just
the second values of si (the si,b ). This allows us to use the scan algorithm of
Section 1.2 with operator • to solve any recurrence of the form (1.5) on an
EREW PRAM in time:
(T⊙ + T⊗ + T⊕ )TS (n, p) = 2(T⊙ + T⊗ + T⊕ )(n/p + lg p) (1.9)
where T⊙ , T⊗ and T⊕ are the times taken by ⊙, ⊗ and ⊕ (• makes one call to
each). If all that is needed is the final value xn−1 , then we can use a reduce
instead of scan with the operator •, and the running time is:
(T⊙ + T⊗ + T⊕ )TR (n, p) = (T⊙ + T⊗ + T⊕ )(n/p + lg p) (1.10)
which is asymptotically a factor of 2 faster than (1.9).
Applications of first-order linear recurrences include the simulation of
various time-varying linear systems, the backsubstitution phase of tridiagonal
linear-systems solvers, and the evaluation of polynomials.Strana 96
50 Chapter 1. Prefix Sums and Their Applications
1.4.2 Higher Order Recurrences
We now consider the more general order m recurrences of the form:
(
bi 0≤i<m
xi = (1.11)
(xi−1 ⊗ ai,1 ) ⊕ · · · ⊕ (xi−m ⊗ ai,m ) ⊕ bi m ≤ i < n
where ⊕ and ⊗ are binary operators with the same three restrictions as
in (1.5): ⊕ is associative, ⊗ is semiassociative, and ⊗ distributes over ⊕.
To convert this equation into the form (1.5), we define the following
vector of variables:
si = [ xi · · · xi−m+1 ]. (1.12)
Using (1.11) we can write (1.12) as:
ai,1 1 0 ··· 0
. ..
.. 0 1 .
. .. ..
si = [ xi−1 · · · xi−m ] ⊗(v) .. . . 0 ⊕(v) [ bi 0 · · · 0 ]
.
.
. 0 ··· 0 1
ai,m 0 ··· 0 0
= (si−1 ⊗(v) Ai ) ⊕(v) Bi (1.13)
where ⊗(v) is vector-matrix multiply and ⊕(v) is vector addition. If we use
matrix-matrix multiply as the companion operator of ⊗(v) , then (1.13) is in
the form (1.5). The time taken for solving equations of the form (1.11) on an
EREW PRAM is therefore:
(Tm⊗m (m)+Tv⊗m (m)+Tv⊕v (m))TS (n, p) = O((n/p+lg p)Tm⊗m (m)) (1.14)
where Tm⊗m (m) is the time taken by an m ⊗ m matrix multiply. The sequen-
tial complexity for solving the equations is O(nm), so the parallel complexity
is optimal in n when n/p ≥ lg p, but is not optimal in m—the parallel algo-
rithm performs a factor of O(TM⊗M (m)/m) more work than the sequential
algorithm.
Applications of the recurrence (1.11) include solving recurrences of the
form xi = ai + bi /xi−1 (see problem 10), and generating the first n Fibonacci
numbers x0 = x1 = 1, xi = xi−1 + xi−2 (see problem 11).Strana 97
1.5 Segmented Scans 51
a = [5 1 3 4 3 9 2 6]
f = [1 0 1 0 0 0 1 0]
segmented +-scan = [5 6 3 7 10 19 2 8]
segmented max-scan = [5 5 3 4 4 9 2 6]
FIGURE 1.10
The segmented scan operations restart at the beginning of each segment. The
vector f contains flags that mark the beginning of the segments.
1.5
Segmented Scans
This section shows how the vector operated on by a scan can be broken into
segments with flags so that the scan starts again at each segment boundary
(see Figure 1.10). Each of these scans takes two vectors of values: a data vector
and a flag vector. The segmented scan operations present a convenient way
to execute a scan independently over many sets of values. The next section
shows how the segmented scans can be used to execute a parallel quicksort,
by keeping each recursive call in a separate segment, and using a segmented
+-scan to execute a split within each segment.
The segmented scans satisfy the recurrence:
a0
i=0
xi = ai fi = 1 (1.15)
0<i<n
(xi−1 ⊕ ai ) fi = 0
where ⊕ is the original associative scan operator. If ⊕ has an identity I⊕ ,
then (1.15) can be written as:
a0 i=0
xi = (1.16)
(xi−1 ×s fi ) ⊕ ai 0<i<n
where ×s is defined as:
I⊕ f =1
x ×s f = (1.17)
x f = 0.
This is in the form (1.5) and ×s is semiassociative with logical or as the
companion operator (see Problem 9). Since we have reduced (1.15) to theStrana 98
52 Chapter 1. Prefix Sums and Their Applications
form (1.5), we can use the technique described in Section 1 to execute the
segmented scans in time
(Tor + T×s + T⊕ )TS (n, p) . (1.18)
This time complexity is only a small constant factor greater than the
unsegmented version since or and ×s are trivial operators.
1.5.1 Example: Quicksort
To illustrate the use of segmented scans, we consider a parallel version
of quicksort. Similar to the standard sequential version, the parallel version
picks one of the keys as a pivot value, splits the keys into three sets—keys
lesser, equal and greater than the pivot—and recurses on each set.3 The
parallel algorithm has an expected time complexity of O(TS (n, p) lg n) =
O( np lg n + lg2 n).
The basic intuition of the parallel version is to keep each subset in its
own segment, and to pick pivot values and split the keys independently within
each segment. Figure 1.11 shows pseudocode for the parallel quicksort and
gives an example. The steps of the sort are outlined as follows:
1. Check if the keys are sorted and exit the routine if they are.
Each processor checks to see if the previous processor has a lesser
or equal value. We execute a reduce with logical and to check if all
the elements are in order.
2. Within each segment, pick a pivot and distribute it to the other
elements.
If we pick the first element as a pivot, we can use a segmented scan
with the binary operator copy, which returns the first of its two
arguments:
a ← copy(a, b) .
This has the effect of copying the first element of each segment
across the segment. The algorithm could also pick a random ele-
ment within each segment (see Problem 15).
3. Within each segment, compare each element with the pivot and
split based on the result of the comparison.
For the split, we can use a version of the split operation described in
Section 1.3 which splits into three sets instead of two, and which is
3 We do not need to recursively sort the keys equal to the pivot, but the algorithm as
described below does.Strana 99
1.5 Segmented Scans 53
procedure quicksort(keys)
seg-flags[0] ← 1
while not-sorted(keys)
pivots ← seg-copy(keys, seg-flags)
f ← pivots <=> keys
keys ← seg-split(keys, f, seg-flags)
seg-flags ← new-seg-flags(keys, pivots, seg-flags)
Key = [6.4 9.2 3.4 1.6 8.7 4.1 9.2 3.4]
Seg-Flags = [1 0 0 0 0 0 0 0]
Pivots = [6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.4]
F = [= > < < > < > <]
Key ← split(Key, F) = [3.4 1.6 4.1 3.4 6.4 9.2 8.7 9.2]
Seg-Flags = [1 0 0 0 1 1 0 0]
Pivots = [3.4 3.4 3.4 3.4 6.4 9.2 9.2 9.2]
F = [= < > = = = < =]
Key ← split(Key, F) = [1.6 3.4 3.4 4.1 6.4 8.7 9.2 9.2]
Seg-Flags = [1 1 0 1 1 1 1 0]
FIGURE 1.11
An example of parallel quicksort. On each step, within each segment, we dis-
tribute the pivot, test whether each element is equal-to, less-than or greater-
than the pivot, split into three groups, and generate a new set of segment
flags. The operation <=> returns one of three values depending on whether
the first argument is less than, equal to or greater than the second.
segmented. To implement such a segmented split, we can use a seg-
mented version of the +-scan operation to generate indices relative
to the beginning of each segment, and we can use a segmented copy-
scan to copy the offset of the beginning of each segment across the
segment. We then add the offset to the segment indices to generate
the location to which we permute each element.
4. Within each segment, insert additional segment flags to separate
the split values.
Knowing the pivot value, each element can determine if it is at the
beginning of the segment by looking at the previous element.
5. Return to step 1.Strana 100
54 Chapter 1. Prefix Sums and Their Applications
Each iteration of this sort requires a constant number of calls to the scans
and to the primitives of the PRAM. If we select pivots randomly within each
segment, quicksort is expected to complete in O(lg n) iterations, and therefore
has an expected running time of O(lg n · TS (n, p)).
The technique of recursively breaking segments into subsegments and
operating independently within each segment can be used for many other
divide-and-conquer algorithms, such as mergesort.
1.6
Allocating Processors
Consider the following problem: given a set of processors, each contain-
ing an integer, allocate that integer number of new processors to each initial
processor. Such allocation is necessary in the parallel line-drawing routine
described in Section 1. In this line-drawing routine, each processor calcu-
lates the number of pixels in the line and dynamically allocates a processor
for each pixel. Allocating new elements is also useful for the branching part
of many branch-and-bound algorithms. Consider, for example, a brute force
chess-playing algorithm that executes a fixed-depth search of possible moves
to determine the best next move. We can test or search the moves in parallel
by placing each possible move in a separate processor. Since the algorithm
dynamically decides how many next moves to generate (depending on the
position), we need to dynamically allocate new processing elements.
More formally, given a length l vector A with integer elements ai , allo-
cation is the task of creating a new vector B of length
l−1
X
L= ai (1.19)
i=0
with ai elements of B assigned to each position i of A. By assigned to, we
mean that there must be some method for distributing a value at position i of
a vector to the ai elements which are assigned to that position. Since there is
a one-to-one correspondence between elements of a vector and processors, the
original vector requires l processors and the new vector requires L processors.
Typically, an algorithm does not operate on the two vectors at the same time,
so that we can use the same processors for both.
Allocation can be implemented by assigning a contiguous segment of
elements to each position i of A. To allocate segments we execute a +-prescanStrana 101
1.6 Allocating Processors 55
V [v1 = v2 v3 ]
A [4 = 1 3]
Hpointers ← +-prescan(A) [0 = 4 5]
? ?
Segment-flag = [1 0 0 0 1 1 0 0]
distribute(V, Hpointers) = [v1 v1 v1 v1 v2 v3 v3 v3 ]
index(Hpointers) = [0 1 2 3 0 0 1 2]
FIGURE 1.12
An example of processor allocation. The vector A specifies how many new
elements each position needs. We can allocate a segment to each position by
applying a +-prescan to A and using the result as pointers to the beginning
of each segment. We can then distribute the values of V to the new elements
with a permute to the beginning of the segment and a segmented copy-scan
across the segment.
on the vector A that returns a pointer to the start of each segment (see Fig-
ure 1.12). We can then generate the appropriate segment flags by writing a
flag to the index specified by the pointer. To distribute values from each posi-
tion i to its segment, we write the values to the beginning of the segments and
use a segmented copy-scan operation to copy the values across the segment.
Allocation and distribution each require one call to a scan and therefore have
complexity TS (l, p) and TS (L, p) respectively.
Once a segment has been allocated for each initial element, it is often
necessary to generate indices within each segment. We call this the index
operation, and it can be implemented with a segmented +-prescan.
1.6.1 Example: Line Drawing
As an example of how allocation is used, consider line drawing. The
line-drawing problem is: given a set of pairs of points
h(x0 , y0 ) : (x̂0 , ŷ0 )i, . . . , h(xn−1 , yn−1 ) : (x̂n−1 , ŷn−1 )i ,
generate all the locations of pixels that lie between on of the pairs of points.
Figure 1.13 illustrates an example. The routine we discuss returns a vector ofStrana 102
56 Chapter 1. Prefix Sums and Their Applications
procedure line-draw(x, y)
in parallel for each line i
% determine the length of the line
length[i] ← max(|p2 [i].x − p1 [i].x|, |p2 [i].y − p1 [i].y|)
% determine the x and y increments
∆[i].x ← (p2 [i].x − p1 [i].x) / length[i]
∆[i].y ← (p2 [i].y − p1 [i].y) / length[i]
% distribute values and generate index
p′1 ← distribute(p1, lengths)
∆′ ← distribute(∆, lengths)
index ← index(lengths)
in parallel for each pixel j
% determine the final position
result[j].x ← p′1 [j].x + round(index[j] × ∆′ [j].x)
result[j].y ← p′1 [j].y + round(index[j] × ∆′ [j].y)
FIGURE 1.13
The pixels generated by a line drawing routine. In this example the endpoints
are h(11, 2) : (23, 14)i, h(2, 13) : (13, 8)i, and h(16, 4) : (31, 4)i. The algorithm
allocates 12, 11 and 16 pixels respectively for the three lines.Strana 103
1.7 Exercises 57
(x, y) pairs that specify the position of each pixel along every line. If a pixel
appears in more than one line, it will appear more than once in the vector.
The routine generates the same set of pixels as generated by the simple digital
differential analyzer sequential technique.
The basic idea of the routine is for each line to allocate a processor
for each pixel in the line, and then for each allocated pixel to determine, in
parallel, its final position in the grid. Figure 1.13 shows the code. To allocate
a processor for each pixel, each line must first determine the number of pixels
in the line. This number can be calculated by taking the maximum of the x
and y differences of the line’s endpoints. Each line now allocates a segment of
processors for its pixels, and distributes one endpoint along with the per-pixel
x and y increments across the segment. We now have one processor for each
pixel and one segment for each line. We can view the position of a processor
in its segment as the position of a pixel in its line. Based on the endpoint the
slope and the position in the line (determined with a index operation), each
pixel can determine its final (x, y) location in the grid.
This routine has the same complexity as a scan TS (m, p), where m is
the total number of pixels. To actually place the points on a grid, rather than
just generating their position, we would need to permute a flag to a position
based on the location of the point. In general, this will require the simplest
form of concurrent-write (one of the values gets written), since a pixel might
appear in more than one line.
1.7
Exercises
1.1 Modify the algorithm in Figure 1.4 to execute a scan instead of a prescan.
1.2 Use the scan operation to compare two strings of length n in O(n/p + lg p)
time on an EREW PRAM.
1.3 Given two vectors of bits that represent nonnegative integers, show how a
prescan can be used to add the two numbers (return a vector of bits that
represents the sum of the two numbers).
1.4 Trace the steps of the split-radix sort on the vector
[2 11 4 5 9 6 15 3].
1.5 Show that subtraction is semiassociative and find its companion operator.
1.6 Write a recurrence equation of the form (1.5) that evaluates a polynomial
y = b1 xn−1 + b2 xn−2 + · · · + bn−1 x + bnStrana 104
58 Chapter 1. Prefix Sums and Their Applications
for a given x.
1.7 Show that if ⊗ has an inverse, the recurrence of the form (1.5) can be solved
with some local operations (not involving communication among processors)
and two scan operations (using ⊗ and ⊕ as the operators).
1.8 Prove that vector-matrix multiply is semiassociative.
1.9 Prove that the operator ×s defined in (1.17) is semiassociative.
1.10 Show how the recurrence x(i) = a(i) + b(i)/x(i − 1), where + is numeric
addition and / is division, can be converted into the form (1.11) with two
terms (m = 2).
1.11 Use a scan to generate the first n Fibonacci numbers.
1.12 Show how to solve a tridiagonal linear-system using the recurrences in Sec-
tion 1.4. Is the algorithm asymptotically optimal?
1.13 In the language Common Lisp, the % character means that what follows the
character up to the end of the line is a comment. Use the scan operation to
mark all the comment characters (everything between a % and an end-of-
line).
1.14 Trace the steps of the parallel quicksort on the vector
[27 11 51 5 49 36 15 23].
1.15 Describe how quicksort is changed so that it selects a random element within
each segment for a pivot.
1.16 Design an algorithm that given the radius and number of sides on a regular
polygon, determines all the pixels that outline the polygon.
Notes and References
The all-prefix-sums operation has been around for centuries as the recur-
rence xi = ai + xi−1 . A parallel circuit to execute the scan operation was first
suggested by Ofman (1963) for the addition of binary numbers. A parallel
implementation of scans on a perfect shuffle network was later suggested by
Stone (1971) for polynomial evaluation. The optimal algorithm discussed in
Section 1.2 is a slight variation of algorithms suggested by Kogge and Stone
(1973) and by Stone (1975) in the context of recurrence equations.
Ladner and Fischer (1980) first showed an efficient general-purpose cir-
cuit for implementing the scan operation. Brent and Kung (1980), in theStrana 105
1.7 Bibliography 59
context of binary addition, first showed an efficient VLSI layout for a scan
circuit. More recent work on implementing scan operations in parallel include
the work of Fich (1983) and of Lakshmivarahan, Yang and Dhall (1987), which
give improvements over the circuit of Ladner and Fischer, and of Lubachevsky
and Greenberg (1987), which demonstrates the implementation of the scan
operation on asynchronous machines. Blelloch (1987) suggested that certain
scan operations be included in the PRAM model as primitives and shows how
this affects the complexity of various algorithms. Work on the linked-list-
based all-prefix-sums operation is considered and referenced in Chapters 2, 3
and 4.
The line-of-sight and radix-sort algorithms are discussed by Blelloch
(1988, 1990). The parallel solution of recurrence problems was first discussed
by Karp, Miller and Winograd (1967), and parallel algorithms to solve them
are given by Kogge and Stone (1973), Stone (1973, 1975) and Chen and Kuck
(1975). Hyafil and Kung (1977) show that the complexity (1.10) is a lower
bound.
Schwartz (1980) and, independently, Mago (1979) first suggested the
segmented versions of the scans. Blelloch (1990) suggested many uses of
these scans including the quicksort algorithm and the line-drawing algorithm
presented in Sections 1 and 1.
I would like to thank Siddhartha Chatterjee, Jonathan Hardwick and
Jay Sipelstein for reading over drafts of this chapter.
Bibliography
Blelloch, G.E., Scans as Primitive Parallel Operations. IEEE Transactions on Com-
puters, C-38(11):1526–1538, November 1989.
Blelloch, G.E., Vector Models for Data-Parallel Computing. MIT Press, Cambridge,
MA, 1990.
Blelloch, G.E., and Little, J.J., Parallel Solutions to Geometric Problems on the Scan
Model of Computation. In Proceedings International Conference on Parallel
Processing, pages Vol 3: 218–222, August 1988.
Brent, R.P., and Kung, H.T., The Chip Complexity of Binary Arithmetic. In Pro-
ceedings ACM Symposium on Theory of Computing, pages 190–200, 1980.
Chen, S., and Kuck, D.J., Time and Parallel Processor Bounds for Linear Recurrence
Systems. IEEE Transactions on Computers, C-24(7), July 1975.
Fich, F.E., New Bounds for Parallel Prefix Circuits. In Proceedings ACM Symposium
on Theory of Computing, pages 100–109, April 1983.Strana 106
60 Chapter 1. Prefix Sums and Their Applications
Hyafil, L., and Kung, H.T., The Complexity of Parallel Evaluation of Linear Recur-
rences. Journal of the Association for Computing Machinery, 24(3):513–521,
July 1977.
Karp, R.H., Miller, R.E., and Winograd S., The Organization of Computations for
Uniform Recurrence Equations. Journal of the Association for Computing
Machinery, 14:563–590, 1967.
Kogge, P.M., and Stone, H.S., A Parallel Algorithm for the Efficient Solution of a
General Class of Recurrence Equations. IEEE Transactions on Computers,
C-22(8):786–793, August 1973.
Ladner, R.E., and Fischer, M.J., Parallel Prefix Computation. Journal of the Asso-
ciation for Computing Machinery, 27(4):831–838, October 1980.
Lakshmivarahan, S., Yang, C.M., and Dhall, S.K., Optimal Parallel Prefix Circuits
with (size + depth) = 2n−n and ⌈log n⌉ ≤ depth ≤ ⌈2 log n⌉−3. In Proceedings
International Conference on Parallel Processing, pages 58–65, August 1987.
Lubachevsky, B.D., and Greenberg, A.G., Simple, Efficient Asynchronous Parallel
Prefix Algorithms. In Proceedings International Conference on Parallel Pro-
cessing, pages 66–69, August 1987.
Mago, G.A., A network of computers to execute reduction languages. International
Journal of Computer and Information Sciences, 1979.
Ofman, Y., On the Algorithmic Complexity of Discrete Functions. Soviet Physics
Doklady, 7(7):589–591, January 1963.
Schwartz, J.T., Ultracomputers. ACM Transactions on Programming Languages
and Systems, 2(4):484–521, October 1980.
Stone, H.S., Parallel Processsing with the Perfect Shuffle. IEEE Transactions on
Computers, C-20(2):153–161, 1971.
Stone, H.S., An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear
System of Equations. Journal of the Association for Computing Machinery,
20(1):27–38, January 1973.
Stone, H.S., Parallel Tridiagonal Equation Solvers. ACM Transactions on Mathe-
matical Software, 1(4):289–307, December 1975.