8a525c121aa47215ecd4b4844600e00e.ppt
- Количество слайдов: 50
P ﻓﺼﻞ 21 : پﺮﻭﺗﻜﻞﻫﺎﻱ Roll-back Recovery ﺩﺭ ﺳﻴﺴﺘﻢ گﺬﺭ ﺩﻫﻲ پﻴﺎﻡ ﺩﺭﺱ ﻃﺮﺍﺣﻲ ﺳﻴﺘﻢﻫﺎﻱ ﻣﻄﻤﺌﻦ
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 2 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
ﻣﻘﺪﻣﻪ: n n n ﺍﻣﺮﻭﺯﻩ ﺳﻴﺴﺘﻤﻬﺎﻱ ﺗﻮﺯﻳﻊ ﺷﺪﻩ ﺩﺭ ﻫﻤﻪ ﺟﺎ ﻣﻮﺟﻮﺩ ﻣﻲﺑﺎﺷﻨﺪ ﻭ ﺑﺪﻳﻦ ﺳﺒﺐ ﻣﺎ ﺭﺍ ﻗﺎﺩﺭ ﺑﻪ ﺍﻧﺠﺎﻡ ﺑﺴﻴﺎﺭﻱ ﺍﺯ ﻛﺎﺭﻫﺎ ﻣﻲﻧﻤﺎﻳﺪ. ﺳﻴﺴﺘﻤﻬﺎﻱ Client Server ﻭ WWW ﻭ ﻣﺤﺎﺳﺒﺎﺕ ﻋﻠﻤﻲ ﺍﺯ ﺟﻤﻠﻪ ﺑﺴﻴﺎﺭﻱ ﺍﺯ آﻨﻬﺎ ﻣﻲﺑﺎﺷﻨﺪ. پﺘﺎﻧﺴﻴﻞ ﺍﻳﻦ ﺳﻴﺴﺘﻤﻬﺎ ﺑﻪ ﺧﺎﻃﺮ گﺴﺘﺮﺩگﻲ ﻣﺤﺎﺳﺒﺎﺕ ﻭ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﺣﺴﺎﺳﻴﺖ آﻨﻬﺎ ﻧﺴﺒﺖ ﺑﻪ ﻭﻗﻮﻉ ﺧﻄﺎ ﻣﺤﺪﻭﺩ ﻣﻲگﺮﺩﺩ. ﺍﺯ ﺍﻳﻨﺮﻭ ﺗﻜﻨﻴﻜﻬﺎﻱ ﻓﺮﺍﻭﺍﻧﻲ ﺑﺮﺍﻱ ﻗﺎﺑﻠﻴﺖ ﺍﻃﻤﻴﻨﺎﻥ، ﺩﺳﺘﺮﺱ پﺬﻳﺮﻱ ﺑﺎﻻ ﺑﺮﺍﻱ ﺳﻴﺴﺘﻢﻫﺎﻱ ﺗﻮﺯﻳﻊ ﺷﺪﻩ گﺴﺘﺮﺵ ﻳﺎﻓﺘﻪ ﺍﺳﺖ. ﺍﻳﻦ ﺗﻜﻨﻴﻜﻬﺎ ﺷﺎﻣﻞ ﻣﻮﺍﺭﺩ ﺯﻳﺮ ﻣﻲﺑﺎﺷﺪ: q : Transactions n q : Group communications n q ﺍﻧﺘﺰﺍﻋﻲ ﺍﺯ ﻳﻚ ﺳﻴﺘﻢ ﺍﺭﺗﺒﺎﻃﻲ ﺍﻳﺪﻩآﻞ ﻛﻪ ﺑﺮﻧﺎﻣﻪ ﻧﻮﻳﺲ ﻣﻲﺗﻮﺍﻧﺪ ﺑﺮﻧﺎﻣﻪﺍﺵ ﺭﺍ ﺑﺼﻮﺭﺕ ﻣﻄﻤﺌﻦ ﺑﺮ ﺭﻭﻱ آﻦ ﺗﻮﺳﻌﻪ ﺩﻫﺪ، پﻴﺸﻨﻬﺎﺩ ﻣﻲﻛﻨﺪ. : Rollback recovery n 3 ﺑﺮ ﻛﺎﺭﺑﺮﺩﻫﺎﻱ ﺩﺍﺩﻩ گﺮﺍ ﺗﻤﺮﻛﺰ ﺩﺍﺭﺩ. ﺑﺮ ﺭﻭﻱ ﻛﺎﺭﺑﺮﺩﻫﺎﻱ ﺑﺎ ﺯﻣﺎﻥ ﺍﺟﺮﺍﻱ ﻃﻮﻻﻧﻲ ﺗﻤﺮﻛﺰ ﺩﺍﺭﺩ، ﻣﺎﻧﻨﺪ ﻣﺤﺎﺳﺒﺎﺕ ﻋﻠﻤﻲ ﻭ ﻛﺎﺭﺑﺮﺩﻫﺎﻱ ﺍﺭﺗﺒﺎﻃﻲ. Rollback Fualt Tolerancy
ﻣﺪﻝ ﺳﻴﺴﺘﻢ گﺬﺭ ﺩﻫﻲ پﻴﺎﻡ ) (message passing system ﺷﺎﻣﻞ ﺗﻌﺪﺍﺩ ﺛﺎﺑﺘﻲ ﻓﺮﺍﻳﻨﺪ کﻪ ﺑﻴﻦ آﻨﻬﺎ پﻴﺎﻡ ﺭﺩ ﻭ ﺑﺪﻝ ﻣﻲﺷﻮﺩ، ﻣﻲﺑﺎﺷﺪ. ﻓﺮﺍﻳﻨﺪﻫﺎ ﺑﺮﺍﻱ ﺍﺟﺮﺍﻱ ﺑﺮﻧﺎﻣﻪ کﺎﺭﺑﺮﺩﻱ ﺗﻮﺯﻳﻊ ﺷﺪﻩ ﺿﻤﻦ ﺗﻌﺎﻣﻞ ﺑﺎ ﺟﻬﺎﻥ ﺧﺎﺭﺝ ﺑﺎ ﺩﺭﻳﺎﻓﺖ پﻴﺎﻡ ﻭﺭﻭﺩﻱ ﻭ ﺍﺭﺳﺎﻝ پﻴﺎﻡ ﺧﺮﻭﺟﻲ، ﺑﺎ ﻫﻢ ﻫﻤکﺎﺭﻱ ﻣﻲکﻨﻨﺪ. ﻳک ﻓﺮﺍﻳﻨﺪ ﻣﻤکﻦ ﺍﺳﺖ ﺑﺪﻟﻴﻞ گﻢ ﻛﺮﺩﻥ ﺣﺎﻟﺖ ﻣﺪﻝ ﺗﻮﻗﻒ ﺑﻪ ﻣﺤﺾ ﺧﺮﺍﺑﻲ، ﺩچﺎﺭ ﺧﺮﺍﺑﻲ ﺷﻮﺩ. 4 ﻣﻮﻗﺖ Rollback Fualt Tolerancy ﺧﻮﺩ ﻳﺎ ﺗﻮﻗﻒ ﺍﺟﺮﺍ ﺑﺮ ﺍﺳﺎﺱ
ﻣﻔﻬﻮﻡ Rollback Recovery ﺩﺭ ﺳﻴﺴﺘﻢﻫﺎﻱ ﺗﻮﺯﻳﻊ ﺷﺪﻩ n n ﺩﺭ ﺳﻴﺴﺘﻢ ﻓﻮﻕ ﺗﺤﻤﻞپﺬﻳﺮﻱ ﺧﻄﺎ ﺑﺎ ﺫﺧﻴﺮﻩ ﺣﺎﻟﺖ ﻓﺮﺍﻳﻨﺪﻫﺎﻱ ﺳﻴﺴﺘﻢ ﺩﺭ ﺩﻭﺭﻩﻫﺎﻱ ﺧﺎﺹ، ﺯﻣﺎﻥ ﺍﺟﺮﺍﻱ ﻋﺎﺭﻱ ﺍﺯ ﺧﻄﺎ ﻃﺒﻖ ﺳﻴﺎﺳﺖ ﻣﻌﻴﻦ، ﺭﻭﻱ ﻣﻨﺒﻊ ﺫﺧﻴﺮﻩ پﺎﻳﺪﺍﺭ Stale Storage ﻭ ﺑﺎﺯگﺸﺖ ﺑﻪ ﻳﻜﻲ ﺍﺯ آﻦ ﺣﺎﻻﺕ ﻫﻨگﺎﻡ ﺑﺮﻭﺯ ﺧﺮﺍﺑﻲ ﻣﻴﺴﺮ ﻣﻲﺷﻮﺩ. ﺍﻳﻦ ﻛﺎﺭ ﺳﺒﺐ ﻛﺎﻫﺶ ﻣﺤﺎﺳﺒﺎﺕ ﺍﺯ ﺩﺳﺖ ﺭﻓﺘﻪ ﻣﻲﺷﻮﺩ. ﻫﺮ ﺣﺎﻟﺖ ﺫﺧﻴﺮﻩ ﺷﺪﻩ ﻳﻚ Checkpoint ﻧﺎﻡ ﺩﺍﺭﺩ. ﻋﻤﻠﻴﺎﺕ ﺑﺎﺯﻳﺎﻓﺘﻲ ﻛﻪ ﻓﺮﺍﻳﻨﺪ ﺑﻪ ﻣﺤﺾ ﻭﻗﻮﻉ ﺧﻄﺎ ﺍﻧﺠﺎﻡ ﻣﻲﺩﻫﺪ ﺗﺎ ﺑﻪ ﻳﻜﻲ ﺍﺯ ﺍﻳﻦ Checkpoint ﻫﺎ ﺑﺎﺯگﺸﺖ ﻧﻤﺎﻳﺪ Rollback Recovery ﻣﻲگﻮﻳﻨﺪ. q n پﺮﻭﺗﻜﻞﻫﺎﻱ Rollback Recovery ﺑﻪ ﺩﻭ ﺩﺳﺘﻪ ﺗﻘﺴﻴﻢ ﻣﻲﺷﻮﻧﺪ: q q 5 Rollback Recovery ﺑﺎ ﺳﻴﺴﺘﻢ ﺗﻮﺯﻳﻊ ﺷﺪﻩ ﺑﻌﻨﻮﺍﻥ ﻣﺠﻤﻮﻋﻪﺍﻱ ﺍﺯ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻛﻪ ﺩﺭ ﻃﻮﻝ ﺷﺒﻜﻪ ﺑﺎ ﻫﻢ ﺩﺭ ﺍﺭﺗﺒﺎﻁ ﻫﺴﺘﻨﺪ ﺭﻓﺘﺎﺭ ﻣﻲﻛﻨﺪ. Checkpoint-based n ﺑﺮﺍﻱ ﺟﻠﻮگﻴﺮﻱ ﺍﺯ ﻫﺪﺭ ﺭﻓﺘﻦ ﻣﺤﺎﺳﺒﺎﺕ ﻭ ﻋﻤﻠﻴﺎﺕ، ﺑﺮ ﺭﻭﻱ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺩﺭ ﻣﻮﺍﺭﺩﻱ ﻃﺒﻖ ﺳﻴﺎﺳﺖ ﺧﺎﺹ ﺍﺯ ﻭﺿﻌﻴﺖ ﻓﺮﺍﻳﻨﺪ Checkpoint گﺮﻓﺘﻪ ﻣﻲﺷﻮﺩ. ﺑﺴﺘﻪ ﺑﻪ ﻧﺤﻮﻩ گﺮﻓﺘﻦ Checkpoint ﺑﻪ ﺳﻪ ﺩﺳﺘﻪ ﻫﻤﺎﻫﻨگ، ﻧﺎﻫﻤﺎﻫﻨگ ﻭ ﻭﺍﺑﺴﺘﻪ ﺑﻪ ﺍﺭﺗﺒﺎﻁ ﺗﻘﺴﻴﻢ ﻣﻲﺷﻮﺩ. log-based n ﻋﻼﻭﻩ ﺑﺮ ﻣﻮﺭﺩ ﺑﺎﻻ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮﻗﻄﻌﻲ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺭﺍ ﻧﻴﺰ ﺛﺒﺖ ﻣﻲﻛﻨﺪ ﺗﺎ ﻣﻘﺪﺍﺭ ﺑﻴﺸﺘﺮﻱ ﺍﺯ ﻋﻤﻠﻴﺎﺕ ﺍﻧﺠﺎﻡ ﺷﺪﻩ ﺭﺍ ﺑﺎﺯﻳﺎﻓﺖ ﻧﻤﺎﻳﺪ. ﺑﺴﺘﻪ ﺑﻪ ﻧﻮﻉ ﺛﺒﺖ ﺭﺧﺪﺍﺩ ﺑﻪ ﺳﻪ ﺩﺳﺘﻪ ﺑﺪﺑﻴﻨﺎﻧﻪ، ﺧﻮﺷﺒﻴﻨﺎﻧﻪ، Rollback Fualt Tolerancy ﺳﺒﺒﻲ ﺗﻘﺴﻴﻢ ﻣﻲﺷﻮﺩ.
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 6 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
ﺣﺎﻟﺖ ﺳﺎﺯگﺎﺭ ﺳﺮﺍﺳﺮﻱ ﺳﻴﺴﺘﻢ n n n ﺑﺪﻟﻴﻞ ﺍﻧﺘﻘﺎﻝ پﻴﺎﻡ ﺑﻴﻦ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺩﺭ ﺳﻴﺴﺘﻢ گﺎﻫﻲ پﻴﺶ ﻣﻲآﻴﺪ ﻛﻪ ﻳﻚ Checkpoint ﻧﺸﺎﻥﺩﻫﻨﺪﻩ آﻦ ﺍﺳﺖ ﻛﻪ ﻓﺮﺍﻳﻨﺪ پﻴﺎﻣﻲ ﺩﺭﻳﺎﻓﺖ ﻛﺮﺩﻩ، ﺩﺭ ﺣﺎﻟﻲ ﻛﻪ ﻫﻴچ Checkpoint ﺍﺯ ﻓﺮﺍﻳﻨﺪ ﺩﻳگﺮﻱ ﻧﺸﺎﻥﺩﻫﻨﺪﻩ ﺍﺭﺳﺎﻝ آﻦ پﻴﺎﻡ ﻧﻴﺴﺖ. چﻨﻴﻦ پﻴﺎﻣﻲ ﺭﺍ ﻳﺘﻴﻢ Orphan ﻣﻲگﻮﻳﻨﺪ. ﺩﺭ Recovery ﺑﻪ ﺗﺮﻛﻴﺒﻲ ﺍﺯ Checkpoint ﻫﺎ ﻛﻪ ﺩﺭ Rollback ﺑﺪﺳﺖ ﻣﻲآﻴﺪ، ﺣﺎﻟﺖ ﺳﻴﺴﺘﻢ ﻣﻲگﻮﻳﻨﺪ. ﺍﻳﻦ ﺣﺎﻟﺖ ﺑﺪﻟﻴﻞ ﻭﺟﻮﺩ پﻴﺎﻡ ﻳﺘﻴﻢ ﺑﻪ ﺩﻭ ﺩﺳﺘﻪ ﺗﻘﺴﻴﻢ ﻣﻲﺷﻮﻧﺪ: q q n 7 ﺣﺎﻟﺘﻲ ﻛﻪ ﺩﺭ آﻦ پﻴﺎﻡ ﻳﺘﻴﻢ ﻭﺟﻮﺩ ﺩﺍﺷﺘﻪ ﺑﺎﺷﺪ ﺣﺎﻟﺖ ﻧﺎﺳﺎﺯگﺎﺭ Inconsistent ﺣﺎﻟﺘﻲ ﻛﻪ پﻴﺎﻡ ﻳﺘﻴﻢ ﻧﺪﺍﺷﺘﻪ ﺑﺎﺷﺪ ﻭ ﻳﻚ ﺍﺟﺮﺍﻱ ﺩﺭﺳﺖ ﺭﺍ ﻧﺸﺎﻥ ﺩﻫﺪ ﺳﺎﺯگﺎﺭ Consistent گﻮﻳﻨﺪ. ﺩﺭ Recovery ﻫﺪﻑ پﻴﺪﺍ ﻛﺮﺩﻥ ﺗﺮﻛﻴﺒﻲ ﺍﺯ Checkpoint ﻫﺎ ﺑﻄﻮﺭﻱ ﻛﻪ ﺣﺎﻟﺖ ﺳﺎﺯگﺎﺭ ﺳﺮﺍﺳﺮﻱ ﺳﻴﺴﺘﻢ ﺭﺍ ﻧﺸﺎﻥ ﺩﻫﺪ، ﻭ ﺑﺎﺯگﺸﺖ ﺑﻪ آﻦ ﺣﺎﻟﺖ ﺩﺭ ﺳﻴﺴﺘﻢ ﻣﻲﺑﺎﺷﺪ. Rollback Fualt Tolerancy
پﻴﺎﻡ گﺬﺭﺍ پﻴﺎﻣﻲ ﺍﺳﺖ کﻪ ﻓﺮﺳﺘﺎﺩﻩ ﺷﺪﻩ ﺍﻣﺎ ﻫﻨﻮﺯ ﺩﺭﻳﺎﻓﺖ ﻧﺸﺪﻩ ﺍﺳﺖ. ﻓﺮﺍﻳﻨﺪ 2 P ﻧﺸﺎﻥ ﻣﻲﺩﻫﺪ 2 m ﺩﺭﻳﺎﻓﺖ ﺷﺪﻩ ﺍﻣﺎ ﺣﺎﻟﺖ ﻓﺮﺍﻳﻨﺪ 1 P ﺍﺭﺳﺎﻝ آﻦ ﺭﺍ ﻣﻨﻌکﺲ ﻧکﺮﺩﻩ ﺍﺳﺖ. 8 ﻣﻮﻗﻌﻴﺘﻲ ﺭﺍ ﻧﺸﺎﻥ ﻣﻲﺩﻫﺪ کﻪ پﻴﺎﻡ ﺍﺯ ﻓﺮﺳﺘﻨﺪﻩ ﺍﺭﺳﺎﻝ ﺷﺪﻩ ﻭ ﺩﺭ ﺷﺒکﻪ ﺳﺮگﺮﺩﺍﻥ ﻣﻲﺑﺎﺷﺪ. پﻴﺎﻡ ﻳﺘﻴﻢ پﻴﺎﻣﻲ ﺍﺳﺖ کﻪ ﺩﺭﻳﺎﻓﺖ ﺷﺪﻩ ﺍﻣﺎ ﻓﺮﺳﺘﻨﺪﻩﺍﻱ ﻧﺪﺍﺭﺩ. Rollback Fualt Tolerancy
ﻣﺴﻴﺮ Z ﻭ چﺮﺧﻪ Z • ﻣﺴﻴﺮ ) Z ﻣﺴﻴﺮ ﺯﻳگﺰﺍگﻲ( ﺩﻧﺒﺎﻟﻪ ﺧﺎﺻﻲ ﺍﺯ پﻴﺎﻡﻫﺎ ﺍﺳﺖ کﻪ ﺩﻭ checkpoint ﺭﺍ ﺑﻪ ﻫﻢ ﻣﺘﺼﻞ ﻣﻲکﻨﺪ. }4 {m 1, m 2} , {m 3, m • چﺮﺧﻪ Z ﻣﺴﻴﺮ Z ﺍﻱ ﺍﺳﺖ کﻪ ﻧﻘﻄﻪ ﺷﺮﻭﻉ ﻭ پﺎﻳﺎﻥ آﻦ ﻳکﻲ ﺑﺎﺷﺪ. }4 {m 5, m 3, m • ﻳک checkpoint ﺩﺭﻭﻥ چﺮﺧﻪ Z ﻧﻤﻲﺗﻮﺍﻧﺪ ﺑﺨﺸﻲ ﺍﺯ ﻳک ﺣﺎﻟﺖ ﺳﺎﺯگﺎﺭ ﺩﺭ ﺳﻴﺴﺘﻤﻲ کﻪ ﻓﻘﻂ ﺍﺯ checkpoint ﺍﺳﺘﻔﺎﺩﻩ ﻣﻲکﻨﺪ، ﺑﺎﺷﺪ. Z-cycle X Z-path 9 Rollback Fualt Tolerancy
پﻴﺎﻡﻫﺎﻱ گﺬﺭﺍ • ﻭﺟﻮﺩ پﻴﺎﻡ گﺬﺭﺍ ﺑﺴﺘﻪ ﺑﻪ ﺍﻳﻦ ﺍﺳﺖ کﻪ کﺎﻧﺎﻝ ﺍﺭﺗﺒﺎﻃﻲ ﺍﻣﻦ ﺩﺭ ﻣﺪﻝ ﺳﻴﺴﺘﻢ ﻓﺮﺽ ﺷﺪﻩ ﺑﺎﺷﺪ ﻳﺎ ﻧﻪ. • ﻓﺮﺽ ﺍﺭﺗﺒﺎﻁ ﺍﻣﻦ ﻃﺮﺍﺣﻲ پﺮﻭﺗکﻞ ﺭﺍ ﺭﺍﺣﺖ ﻣﻲکﻨﺪ ﺍﻣﺎ پﻴﺎﺩﻩ ﺳﺎﺯﻱ ﺭﺍ ﻣﺸکﻞ ﻣﻲﻧﻤﺎﻳﺪ. 01 Rollback Fualt Tolerancy
ﺍﻃﻼﻋﺎﺕ Checkpointing ﻭ ﻭﺍﺑﺴﺘگﻲ ﻓﺮﺍﻳﻨﺪﻫﺎ n n 11 ﺩﺭ checkpointing ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺣﺎﻟﺖ ﺧﻮﺩ ﺭﺍ ﺑﻪ ﺻﻮﺭﺕ ﺩﻭﺭﻩﺍﻱ ﺑﺮ ﺭﻭﻱ Stable Storage ﺫﺧﻴﺮﻩ ﻣﻲﻛﻨﺪ. ﺣﺎﻟﺖ ﺫﺧﻴﺮﻩ ﺷﺪﻩ ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﺷﺎﻣﻞ ﺍﻃﻼﻋﺎﺕ ﻛﺎﻓﻲ ﺑﺮﺍﻱ ﺷﺮﻭﻉ ﻣﺠﺪﺩ آﻦ ﻓﺮﺍﻳﻨﺪ ﻣﻲﺑﺎﺷﺪ. ﺩﺭ ﺳﻴﺴﺘﻢ گﺬﺭ ﺩﻫﻲ پﻴﺎﻡ ﺑﺨﺎﻃﺮ ﺗﺒﺎﺩﻝ پﻴﺎﻡ ﺑﻴﻦ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻫﻨگﺎﻡ ﻋﻤﻠﻴﺎﺕ ﻋﺎﺭﻱ ﺍﺯ ﺧﻄﺎ، ﻭﺍﺑﺴﺘگﻲ ﺍﻳﺠﺎﺩ ﻣﻲﺷﻮﺩ، ﺑﻪ ﻫﻤﻴﻦ ﺩﻟﻴﻞ ﺩﺍﺭﺍﻱ Rollback Recovery پﻴچﻴﺪﻩ ﻣﻲﺑﺎﺷﺪ. ﻫﺮ ﺣﺎﻟﺖ ﺳﺮﺍﺳﺮﻱ ﺳﺎﺯگﺎﺭ ﺍﺯ checkpoint ﺩﺭ ﺳﻴﺴﺘﻢ ﻣﻲﺗﻮﺍﻧﺪ ﺑﺮﺍﻱ ﺷﺮﻭﻉ ﻣﺠﺪﺩ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺑﻪ ﻣﺤﺾ ﻭﻗﻮﻉ ﺧﻄﺎ ﺑﻜﺎﺭ ﺭﻭﺩ. ﻣﺠﻤﻮﻋﻪ Checkpoint ﻫﺎﻱ ﺳﺮﺍﺳﺮﻱ ﺳﺎﺯگﺎﺭ ﺩﺭ ﺳﻴﺴﺘﻢ ﻳﻚ ﺧﻂ ﺍﺭﺟﺎﻉ ﺭﺍ ﺗﺸﻜﻴﻞ ﺩﺍﺩﻩ ﻛﻪ ﺑﻪ آﻦ Recovery Line گﻮﻳﻨﺪ، ﺍﻳﻦ ﺧﻂ ﻫﻨگﺎﻡ Recovery ﻣﺸﺨﺺ ﻭ ﺳﻴﺴﺘﻢ ﺑﻪ آﻦ ﺣﺎﻟﺖ ﺑﺮﻣﻲگﺮﺩﺩ. Rollback Fualt Tolerancy
Rollback Propagation ﻭ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ n n 21 ﻓﺮﺍﻳﻨﺪﻫﺎ ﺑﺴﺘﻪ ﺑﻪ ﻧﻴﺎﺯﺷﺎﻥ ﻫﻨگﺎﻡ ﺍﺟﺮﺍ ﺑﺎ ﻫﻢ ﺍﺭﺗﺒﺎﻁ ﺑﺮﻗﺮﺍﺭ ﻣﻲﻛﻨﻨﺪ. ﺑﺪﻳﻨﻮﺳﻴﻠﻪ ﻭﺍﺑﺴﺘگﻲ ﺑﻴﻦ آﻨﻬﺎ ﺍﻳﺠﺎﺩ ﻣﻲﺷﻮﺩ. ﻫﻨگﺎﻡ ﻭﻗﻮﻉ ﺧﺮﺍﺑﻲ ﺩﺭ ﻳﻚ ﻳﺎ چﻨﺪ ﻓﺮﺍﻳﻨﺪ، ﺩﺭ ﺯﻣﺎﻥ ، Rollback ﺑﻪ ﺧﺎﻃﺮ ﻭﺍﺑﺴﺘگﻲ ﺑﻴﻦ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻋﻼﻭﻩ ﺑﺮ ﻓﺮﺍﻳﻨﺪﻱ ﻛﻪ ﺩﺭ آﻦ ﺧﺮﺍﺑﻲ ﺭﺥ ﺩﺍﺩﻩ، ﻣﻤﻜﻦ ﺍﺳﺖ ﺳﺎﻳﺮ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻧﻴﺰ ﻣﺠﺒﻮﺭ ﺑﻪ Rollback ﺷﻮﻧﺪ. ﺑﻪ ﺍﻳﻦ پﺪﻳﺪﻩ Rollback Propagation ﻣﻲگﻮﻳﻨﺪ. ﺣﺎﻟﺖ ﺳﺮﺍﺳﺮﻱ ﺳﺎﺯگﺎﺭ checkpoint ﻣﻲﺗﻮﺍﻧﺪ Rollback Propagation ﺭﺍ ﻣﺤﺪﻭﺩ ﻧﻤﺎﻳﺪ. ﺍگﺮ ﺩﺭ ﺑﺮﺧﻲ ﺳﻨﺎﺭﻳﻮﻫﺎﻱ ﺧﺮﺍﺑﻲ ﺍﻳﻦ Rollback Propagation ﻣﻨﺠﺮ ﺑﻪ ﺍﻳﻦ ﺷﻮﺩ ﻛﻪ ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺑﻪ ﺣﺎﻟﺖ ﺍﻭﻟﻴﻪ ﺧﻮﺩ ﺑﺮگﺮﺩﻧﺪ، ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ) Domino (Effect ﺭﺥ ﺩﺍﺩﻩ ﺍﺳﺖ. ﻭﻗﻮﻉ ﺍﻳﻦ پﺪﻳﺪﻩ ﺑﺎﻋﺚ ﺍﺯ ﺩﺳﺖ ﺭﻓﺘﻦ ﺗﻤﺎﻣﻲ ﻣﺤﺎﺳﺒﺎﺕ ﻗﺒﻞ ﺍﺯ ﺧﺮﺍﺑﻲ گﺸﺘﻪ ﻭ ﺳﻴﺴﺘﻢ ﺭﺍ ﺑﻪ ﺣﺎﻟﺘﻲ ﻛﻪ ﻫﻴچ ﻋﻤﻠﻴﺎﺗﻲ ﺍﻧﺠﺎﻡ ﻧﺪﺍﺩﻩ ﺑﻮﺩ ﻣﻲﺑﺮﺩ. ﺑﺪﻳﻦ ﺟﻬﺖ ﺍﻳﻦ پﺪﻳﺪﻩ ﻧﺎﻣﻄﻠﻮﺏ ﺍﺳﺖ. Rollback Fualt Tolerancy
ﺑﺮﺍﻱ ﺍﺟﺘﻨﺎﺏ ﺍﺯ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﺩﺭ ﺳﻴﺴﺘﻢ، ﻓﺮﺍﻳﻨﺪﻫﺎ ﺑﺎﻳﺪ checkpointing ﺧﻮﺩ ﺭﺍ ﺑﺼﻮﺭﺕ ﻫﻤﺎﻫﻨگ ﺍﻧﺠﺎﻡ ﺩﺍﺩﻩ ﻛﻪ ﺳﺒﺐ پﻴﺸﺮﻓﺖ Recovery line ﻣﻲﺷﻮﺩ. ﻳﺎ ﺍﻳﻨﻜﻪ ﻋﻤﻠﻴﺎﺕ checkpointing ﺧﻮﺩ ﺭﺍ ﺑﺎ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺗﺮﻛﻴﺐ ﻛﻨﻨﺪ. Domino Effect Rollback Propagation 3 2 1 6 / / 5 4 31 9 Rollback Fualt Tolerancy 8 7 Initial State
Checkpointing ﻭ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ n n n 41 ﻫﻨگﺎﻣﻲ ﻛﻪ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﻣﺴﺘﻘﻼ ﻋﻤﻠﻴﺎﺕ checkpointing ﺭﺍ ﺍﻧﺠﺎﻡ ﺩﻫﺪ ﻣﻲﺗﻮﺍﻧﺪ ﺳﺒﺐ پﻴﺪﺍﻳﺶ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﺷﻮﺩ. ) ﻋﻤﻠﻴﺎﺕ checkpointing ﻧﺎﻫﻤﺎﻫﻨگ( ﻳﻜﻲ ﺍﺯ ﺭﻭﺷﻬﺎﻳﻲ ﻛﻪ ﺳﺒﺐ ﻣﻲﺷﻮﺩ ﻋﻤﻠﻴﺎﺕ checkpointing ﻫﻤﺎﻫﻨگ ﺍﻧﺠﺎﻡ ﺷﻮﺩ ﺍﻳﻦ ﺍﺳﺖ ﻛﻪ ﺳﻴﺴﺘﻢ ﺩﺭ ﺑﻌﺪ ﻭﺳﻴﻊ ﺣﺎﻟﺖ ﺳﺎﺯگﺎﺭ ﺧﻮﺩ ﺭﺍ ﺫﺧﻴﺮﻩ ﻧﻤﺎﻳﺪ. ﺭﺍﻩ ﺩﻳگﺮ ﺍﻳﻨﻜﻪ checkpointing ﺑﺮ ﻣﺒﻨﺎﻱ ﺍﺭﺗﺒﺎﻁ ﺑﺎﺷﺪ. ﺑﻪ ﺍﻳﻦ ﻣﻌﻨﻲ ﻛﻪ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺭﺍ ﻣﺠﺒﻮﺭ ﻛﻨﺪ ﺑﺮ ﺍﺳﺎﺱ ﺍﻃﻼﻋﺎﺕ ﺳﻮﺍﺭ ﺷﺪﻩ ﺑﺮ پﻴﺎﻡﻫﺎﻳﻲ ﻛﻪ ﺍﺯ ﻓﺮﺍﻳﻨﺪﻫﺎﻱ ﺩﻳگﺮ ﺩﺭﻳﺎﻓﺖ ﻣﻲﻛﻨﺪ، checkpoint ﺧﻮﺩ ﺭﺍ ﺑگﻴﺮﺩ. Checkpoint ﻫﺎﻳﻲ ﻛﻪ ﺩﺭ ﻛﻞ ﺳﻴﺴﺘﻢ ﺑﻪ ﺻﻮﺭﺕ ﺳﺎﺯگﺎﺭ گﺮﻓﺘﻪ ﺷﺪﻩﺍﻧﺪ، ﻫﻤﻴﺸﻪ ﺑﺮ ﺭﻭﻱ Stable Storage ﻭﺟﻮﺩ ﺩﺍﺭﻧﺪ، ﺑﻨﺎﺑﺮﺍﻳﻦ ﺍﺯ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﺍﺟﺘﻨﺎﺏ ﻣﻲﺷﻮﺩ. Rollback Fualt Tolerancy
ﺗﻌﺎﻣﻞ ﺑﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ n ﻳﻚ ﺳﻴﺴﺘﻢ گﺬﺭﺩﻫﻲ پﻴﺎﻡ ﻣﻌﻤﻮﻻ ﺑﺮﺍﻱ ﺩﺭﻳﺎﻓﺖ ﺩﺍﺩﻩ ﻳﺎ ﻧﺸﺎﻥ ﺩﺍﺩﻥ ﺧﺮﻭﺟﻲ ﻣﺤﺎﺳﺒﺎﺕ ﺑﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺩﺭ ﺗﻌﺎﻣﻞ ﺍﺳﺖ. ﺑﺎ ﺍﻳﻦ ﺗﻔﺎﻭﺕ ﻛﻪ ﺍگﺮ ﺧﺮﺍﺑﻲ ﺭﺥ ﺩﻫﺪ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﻧﻤﻲﺗﻮﺍﻧﺪ ﺑﻪ Rollback ﺗﻜﻴﻪ ﻧﻤﺎﻳﺪ. ﺑﻨﺎﺑﺮﺍﻳﻦ پﺮﻭﺗﻜﻞﻫﺎﻱ Rollback ﺑﺮﺍﻱ ﺗﻌﺎﻣﻞ ﺑﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺑﺎﻳﺪ ﺭﻓﺘﺎﺭ ﺧﺎﺻﻲ ﺭﺍ ﺑﺮ گﺰﻳﻨﺪ. ﻗﺒﻞ ﺍﺯ ﺍﺭﺳﺎﻝ ﺧﺮﻭﺟﻲ ﺑﻪ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ، ﺳﻴﺴﺘﻢ ﺑﺎﻳﺪ ﻣﻄﻤﺌﻦ ﺷﻮﺩ ﺣﺎﻟﺘﻲ ﻛﻪ ﺧﺮﻭﺟﻲ ﺍﺯ آﻦ ﺍﺭﺳﺎﻝ ﻣﻲﺷﻮﺩ، ﻋﻠﻴﺮﻏﻢ ﺧﺮﺍﺑﻲ ﺩﺭ آﻴﻨﺪﻩ، ﻗﺎﺑﻞ ﺑﺪﺳﺖ آﻮﺭﺩﻥ ﺍﺳﺖ. )ﻣﺴﺎﻟﻪ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ( n ﺑﺮﺍﻱ پﻴﺎﻡﻫﺎﻱ ﻭﺭﻭﺩﻱ ﺭﺍﻩ ﺣﻞ ﻋﺒﺎﺭﺕ ﺍﺯ ﺫﺧﻴﺮﻩ پﻴﺎﻡ ﻭﺭﻭﺩﻱ ﺑﺮ ﺭﻭﻱ ،Stable Storage ﻗﺒﻞ ﺍﺯ ﺍﻳﻨﻜﻪ ﺑﻪ ﺑﺮﻧﺎﻣﻪ ﻛﺎﺭﺑﺮﺩﻱ ﺍﺟﺎﺯﻩ پﺮﺩﺍﺯﺵ آﻦ ﺩﺍﺩﻩ ﺷﻮﺩ، ﻣﻲﺑﺎﺷﺪ. n n 51 Rollback Fualt Tolerancy
Logging Protocols vs. Checkpointing ﺭﻭﺵ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﻭﻗﺘﻲ ﺗﻌﺎﻣﻞ ﺑﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﻣﻜﺮﺭ ﺍﺳﺖ، ﺑﻴﺸﺘﺮ ﺍﺳﺘﻔﺎﺩﻩ ﻣﻲﺷﻮﺩ. ﺯﻳﺮﺍ ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﺭﺍ ﻗﺎﺩﺭ ﻣﻲﺳﺎﺯﺩ ﺍﺟﺮﺍﻳﺶ ﺭﺍ ﺗﻜﺮﺍﺭ ﻛﻨﺪ ﻭ ﺑﺎ ﺍﺭﺳﺎﻝ ﺧﺮﻭﺟﻲ ﺑﻪ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺑﺪﻭﻥ ﺩﺍﺷﺘﻦ ﻫﺰﻳﻨﻪ گﺮﺍﻥ checkpointing ﻗﺒﻞ ﺍﺯ ﺍﺭﺳﺎﻝ ﺧﺮﻭﺟﻲ ﺳﺎﺯگﺎﺭﻱ ﺩﺍﺷﺘﻪ ﺑﺎﺷﺪ. Orphan message Replay delivery to recover messages X X 61 X With checkpointing Rollback Fualt Tolerancy
Stable Storage & Garbage Collection n n Rollback Recovery ﺍﺯ Stable Storage ﺑﺮﺍﻱ ﺫﺧﻴﺮﻩ checkpoint ﻓﺮﺍﻳﻨﺪﻫﺎ، ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﻭ ﺳﺎﻳﺮ ﺍﻃﻼﻋﺎﺕ ﻣﺮﺑﻮﻁ ﺑﻪ ﺑﺎﺯﻳﺎﺑﻲ ﺍﺳﺘﻔﺎﺩﻩ ﻣﻲﻛﻨﺪ. Garbage Collection ﺑﺮﺍﻱ پﺎﻙ ﻛﺮﺩﻥ ﺍﻃﻼﻋﺎﺕ ﺑﺎﺯﻳﺎﺑﻲ ﺑﻼ ﺍﺳﺘﻔﺎﺩﻩ ﺑﻜﺎﺭ ﻣﻲﺭﻭﺩ. )ﺯﺑﺎﻟﻪ ﺩﺍﻧﻲ checkpoint ﻣﻲﺑﺎﺷﺪ( q n 71 ﻳﻚ ﺭﺍﻩ ﺑﺮﺍﻱ Garbage collection ﻣﺸﺨﺺ ﻛﺮﺩﻥ Recovery line ﻭ ﺣﺬﻑ ﺗﻤﺎﻣﻲ ﺍﻃﻼﻋﺎﺕ ﻣﺮﺑﻮﻁ ﺑﻪ ﺭﺧﺪﺍﺩﻫﺎﻳﻲ ﻛﻪ ﻗﺒﻞ ﺍﺯ آﻦ ﺧﻂ ﺭﺥ ﺩﺍﺩﻩﺍﻧﺪ، ﻣﻲﺑﺎﺷﺪ. ﺍﺟﺮﺍﻱ ﺍﻟگﻮﺭﻳﺘﻢ ﺧﺎﺹ ﺑﺮﺍﻱ ﺣﺬﻑ ﺍﻃﻼﻋﺎﺕ ﺑﻼ ﺍﺳﺘﻔﺎﺩﻩ ﻣﻮﺟﺐ ﺳﺮﺑﺎﺭ ﺩﺭ ﺳﻴﺴﺘﻢ ﻣﻲﺷﻮﺩ. Rollback Fualt Tolerancy
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 81 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
Checkpointing ﻧﺎﻫﻤﺎﻫﻨگ n Checkpointing ﻧﺎﻫﻤﺎﻫﻨگ ﺑﻪ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺍﺟﺎﺯﻩ ﺑﻴﺸﺘﺮﻳﻦ ﺧﻮﺩ ﺍﺳﺘﻘﻼﻟﻲ ﺭﺍ ﺑﺮﺍﻱ ﺗﺼﻤﻴﻢگﻴﺮﻱ ﺩﺭ گﺮﻓﺘﻦ checkpoint ﻣﻲﺩﻫﺪ. q ﻣﺰﻳﺖ ﺍﺻﻠﻲ ﺍﻳﻦ ﺧﻮﺩ ﺍﺳﺘﻘﻼﻟﻲ ﻋﺒﺎﺭﺕ ﺍﺯ: n q ﻣﻌﺎﻳﺐ: 1. 2. 3. 4. 91 ﻫﺮ ﻓﺮﺍﻳﻨﺪ checkpoint ﺧﻮﺩ ﺭﺍ ﺯﻣﺎﻧﻲ ﻛﻪ ﻗﺎﺩﺭ ﺍﺳﺖ، ﻣﻲگﻴﺮﺩ. ﺑﺮﺍﻱ ﻣﺜﺎﻝ ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﻣﻲﺗﻮﺍﻧﺪ ﺳﺮﺑﺎﺭ ﺭﺍ ﺑﺎ checkpointing ﺩﺭ ﻭﻗﺘﻲ ﻛﻪ ﻣﻘﺪﺍﺭ ﺣﺎﻻﺗﻲ ﻛﻪ ﺑﺎﻳﺪ ﺫﺧﻴﺮﻩ ﻛﻨﺪ، ﻛﻮچﻚ ﺑﺎﺷﺪ، ﻛﺎﻫﺶ ﺩﻫﺪ. ﺍﺣﺘﻤﺎﻝ ﻭﻗﻮﻉ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﻛﻪ ﺳﺒﺐ ﺍﺯ ﺩﺳﺖ ﺩﺍﺩﻥ ﻣﻘﺪﺍﺭ ﻭﺳﻴﻌﻲ ﺍﺯ ﻛﺎﺭ ﺍﻧﺠﺎﻡ ﺷﺪﻩ ﻣﻲﺷﻮﺩ، ﻭﺟﻮﺩ ﺩﺍﺭﺩ. ﻓﺮﺍﻳﻨﺪ ﻣﻤﻜﻦ ﺍﺳﺖ checkpoint ﺑﻼ ﺍﺳﺘﻔﺎﺩﻩﺍﻱ ﺭﺍ ﻛﻪ ﻫﺮگﺰ ﺑﺨﺸﻲ ﺍﺯ ﻳﻚ ﺣﺎﻟﺖ ﺳﺎﺯگﺎﺭ ﺳﺮﺍﺳﺮﻱ ﻧﺨﻮﺍﻫﺪ ﺑﻮﺩ، ﺑگﻴﺮﺩ. checkpoint ﻣﺬﻛﻮﺭ ﻣﻄﻠﻮﺏ ﻧﻴﺴﺖ ﺯﻳﺮﺍ ﻣﻮﺟﺐ ﺳﺮﺑﺎﺭ ﺷﺪﻩ ﻭ ﺩﺭ پﻴﺸﺮﻓﺖ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ ﺩﺧﺎﻟﺘﻲ ﻧﺪﺍﺭﺩ. checkpointing ﻧﺎﻫﻤﺎﻫﻨگ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺭﺍ ﻣﺠﺒﻮﺭ ﺑﻪ ﻧگﻬﺪﺍﺭﻱ چﻨﺪﻳﻦ checkpoint ﻣﻲﻛﻨﺪ ﻭ ﺑﺎﻋﺚ ﺍﺟﺮﺍﻱ ﺩﻭﺭﻩﺍﻱ ﺍﻟگﻮﺭﻳﺘﻢ Garbage collection ﺑﺮﺍﻱ ﺩﻭﺭ ﺭﻳﺨﺘﻦ checkpoint ﻫﺎﻳﻲ ﻛﻪ ﻣﺪﺕ ﻃﻮﻻﻧﻲ ﺍﺳﺘﻔﺎﺩﻩ ﻧﺸﺪﻩﺍﻧﺪ، ﻣﻲﺷﻮﺩ. ﺑﺮﺍﻱ ﻛﺎﺭﺑﺮﺩﻫﺎﻳﻲ ﻛﻪ ﺧﺮﻭﺟﻲ ﺩﺍﺭﻧﺪ ﻣﻨﺎﺳﺐ ﻧﻤﻲﺑﺎﺷﺪ ﺯﻳﺮﺍ ﻧﻴﺎﺯﻣﻨﺪ ﻳﻚ ﻫﻤﺎﻫﻨگﻲ ﺳﺮﺍﺳﺮﻱ ﺑﺮﺍﻱ ﻣﺤﺎﺳﺒﻪ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ ﻣﻲﺑﺎﺷﺪ. Rollback Fualt Tolerancy
ﺍﻃﻼﻋﺎﺕ ﻭﺍﺑﺴﺘگﻲ • Let Ci, x be the Xth checkpoint of process Pi (X: Checkpoint index). • Let Ii, x denote the interval between checkpoints Ci, x-1 & Ci, x. • If Pi at Ii, x sends message m to Pj , it will piggyback the pair (i, x) on m. • When Pj receives m in Ij, y , it records dependency info when Pj takes cj, y. Rollback Fualt Tolerancy 20
چگﻮﻧﻪ ﺑﺎﺯﻳﺎﻓﺖ ﺍﻧﺠﺎﻡ ﻣﻲﺷﻮﺩ؟ Dependency request Dependency Information Request for Rollback Calculates recovery line based on global dependency information n atio rm info y enc nd epe D Recovering Process 0 Process 1 Process 2 Re co ve De ry pe lin nd e en cy req ue st ﺍگﺮ ﺣﺎﻟﺖ ﻛﻨﻮﻧﻲ ﺩﺭ ﻃﻮﻝ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ ﺑﻮﺩ ﺍﺟﺮﺍ ﺍﺩﺍﻣﻪ ﻣﻲﻳﺎﺑﺪ، ﺩﺭ ﻧﺰﺩﻳﻚ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ ﻋﻘﺐ گﺮﺩ checkpoint ﻏﻴﺮ ﺍﻳﻨﺼﻮﺭﺕ ﺑﻪ . ﻣﻲﻛﻨﺪ Rollback Fualt Tolerancy Process N 21
ﺩﺭ ﻣﺤﺎﺳﺒﻪ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ checkpoint گﺮﺍﻑ ﻭﺍﺑﺴﺘگﻲ ﻭ گﺮﺍﻑ Dependency Graph Node: checkpoint D-edge: Message from ci, x to cj, y if: • i ≠ j, and M from Ii, x to Ij, y • i = j and y = x + 1 Checkpoint Graph When a message sent from Ii, x to Ij, y , a D-edge drawn ci, x-1 to cj, y (instead ci, x , cj, y( Rollback Fualt Tolerancy 22
Garbage Collection ﺑﺎﺷﺪ ﻣﻲﺗﻮﺍﻧﺪ Recovery Line ﺍﻱ ﺑﺮﺍﻱ ﻫﺮ ﺗﺮﻛﻴﺐ ﻣﻤﻜﻦ ﺍﺯ ﺧﺮﺍﺑﻲ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻛﻪ ﻗﺒﻞ ﺍﺯ checkpoint • ﻫﺮ . ﺑﺎﺷﺪ Garbage collection ﺟﺰﻭ ﺗﻌﺪﺍﺩ ﺯﻳﺎﺩﻱ ﺍﺯ Rollback Propagation پﻴﺸﺮﻓﺖ ﻧﻤﻲﻛﻨﺪ، ﺑﺨﺎﻃﺮ Recovery Line • ﻫﻨگﺎﻣﻲ ﻛﻪ . ﻫﺎﻱ ﻏﻴﺮ ﺿﺮﻭﺭﻱ ﻻﺯﻡ ﺍﺳﺖ ﻧگﻬﺪﺍﺭﻱ ﺷﻮﻧﺪ checkpoint 1. Mark all volatile checkpoints & remove all edges ending in a marked checkpoint. 2. Use reachability analysis to determine the worst-case recovery line. Rollback Fualt Tolerancy 23
Checkpointing ﻫﻤﺎﻫﻨگ n پﺮﻭﺗﻜﻞ checkpointing ﻫﻤﺎﻫﻨگ ﻧﻴﺎﺯ ﺩﺍﺭﺩ ﻛﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ checkpoint ﺧﻮﺩ ﺭﺍ ﺑﻪ ﺻﻮﺭﺕ ﻫﻤﺎﻫﻨگ ﺩﺭ ﻗﺎﻟﺐ ﻳﻚ ﺣﺎﻟﺖ ﺳﺮﺍﺳﺮﻱ ﺳﺎﺯگﺎﺭ ﺑگﻴﺮﻧﺪ. q ﻣﺰﺍﻳﺎ: n n n q ﻋﻴﺐ ﻋﻤﺪﻩ: n 42 ﺳﺎﺩﻩﺳﺎﺯﻱ ﺩﺭ Recovery ﺩﺭ آﻦ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﺍﻧﺘﻈﺎﺭ ﻧﻤﻲﺭﻭﺩ، ﺯﻳﺮﺍ ﻫﻤﻴﺸﻪ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺍﺯ آﺨﺮﻳﻦ checkpoint ﺧﻮﺩ ﺷﺮﻭﻉ ﻣﺠﺪﺩ ﻣﻲﻧﻤﺎﻳﺪ. ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﻓﻘﻂ ﻣﺠﺒﻮﺭ ﺑﻪ ﻧگﻬﺪﺍﺭﻱ ﻳﻚ checkpoint ﺩﺭ stable storage ﻣﻲﺑﺎﺷﺪ ﻛﻪ ﺳﺒﺐ ﻛﺎﻫﺶ ﺳﺮﺑﺎﺭ ﺫﺧﻴﺮﻩ، ﻫﻤچﻨﻴﻦ ﺣﺬﻑ ﻧﻴﺎﺯ ﺑﻪ Garbage collection ﻣﻲﺷﻮﺩ. ﺗﺎﺧﻴﺮ ﻃﻮﻻﻧﻲ ﺑﺮﺍﻱ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ، ﺯﻳﺮﺍ ﻳﻚ checkpoint ﺳﺮﺍﺳﺮﻱ ﻗﺒﻞ ﺍﺯ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ ﺑﻪ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﻧﻴﺎﺯ ﺍﺳﺖ. Rollback Fualt Tolerancy
ﻫﺎ ﻫﻤﺎﻫﻨگ ﻣﻲﺷﻮﻧﺪ؟ chekpoint چگﻮﻧﻪ Block communications while the checkpointing protocol executes Take a checkpoint Request message ke Ta nt poi ck he ac Process 1 Coordinator Co mm it Commit message Process 0 Ac kn ow le dg me me ss ag nt e Process 2 Stop Execution Flush all Communication Channels Take Tentative Checkpoint Send Ack. Process N Remove old checkpoint Makes tentative checkpoint, then free to resume execution and exchange messages Rollback Fualt Tolerancy 25
Non-blocking Checkpoint Coordination ﻫﻤﺎﻫﻨگ ﺟﻠﻮگﻴﺮﻱ ﻓﺮﺍﻳﻨﺪ ﺍﺯ ﺩﺭﻳﺎﻓﺖ پﻴﺎﻡ ﻣﻲﺑﺎﺷﺪ ﻛﻪ ﻣﻲﺗﻮﺍﻧﺪ checkpointing ﻳﻚ ﻣﺴﺎﻟﻪ ﺍﺑﺘﺪﺍﺋﻲ ﺩﺭ . ﺭﺍ ﻧﺎﺳﺎﺯگﺎﺭ ﺳﺎﺯﺩ checkpoint Rollback Fualt Tolerancy 26
ﻛﻼﻙ ﺳﻨﻜﺮﻭﻥ ﺷﺪﻩ checkpoint n n n 72 ﻛﻼﻙ ﺳﻨﻜﺮﻭﻥ ﻣﻲﺗﻮﺍﻧﺪ ﺳﺒﺐ ﺗﺤﺮﻳﻚ ﻋﻤﻠﻴﺎﺕ checkpointing ﻣﺤﻠﻲ ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ، ﺗﻘﺮﻳﺒﺎ ﺩﺭ ﻫﻤﺎﻥ ﺯﻣﺎﻥ، ﺑﺪﻭﻥ checkpoint ﺷﺮﻭﻉ ﻛﻨﻨﺪﻩ ﺷﻮﺩ. ﻳﻚ ﻓﺮﺍﻳﻨﺪ checkpoint ﻣﻲگﻴﺮﺩ ﻭ ﺑﺮﺍﻱ ﻣﺪﺗﻲ ﻛﻪ ﻣﺴﺎﻭﻱ ﺑﺎ ﻣﺠﻤﻮﻉ ﺑﻴﺸﺘﺮﻳﻦ ﺍﻧﺤﺮﺍﻑ ﺑﻴﻦ ﻛﻼﻙ ﻭ ﺑﻴﺸﺘﺮﻳﻦ ﺯﻣﺎﻥ ﺑﺮﺍﻱ ﺗﺸﺨﻴﺺ ﺧﺮﺍﺑﻲ ﺩﺭ ﻓﺮﺍﻳﻨﺪﻫﺎﻱ ﺩﻳگﺮ ﺩﺭ ﺳﻴﺴﺘﻢ ﻣﻲﺑﺎﺷﺪ، ﻣﻨﺘﻈﺮ ﻣﻲﻣﺎﻧﺪ. ﻓﺮﺍﻳﻨﺪﻫﺎ ﻣﻲﺗﻮﺍﻧﻨﺪ ﻣﻄﻤﺌﻦ ﺑﺎﺷﻨﺪ ﻛﻪ ﻫﻤﻪ checkpoint ﻫﺎ ﺑﺪﻭﻥ ﻧﻴﺎﺯ ﺑﻪ ﺗﺒﺎﺩﻝ ﻫﻴچ پﻴﺎﻣﻲ ﺑﻪ ﺻﻮﺭﺕ ﻫﻤﺎﻫﻨگ گﺮﻓﺘﻪ ﺷﺪﻩﺍﻧﺪ. Rollback Fualt Tolerancy
Minimal Checkpoint Coordination n n ﺑﺮﺍﻱ checkpoint ﻫﻤﺎﻫﻨگ ﻻﺯﻡ ﺍﺳﺖ ﻛﻪ ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺩﺭ ﻫﺮ ﻋﻤﻠﻴﺎﺕ checkpointing ﺷﺮﻛﺖ ﻛﻨﻨﺪ. ﺍﻳﻦ ﻧﻴﺎﺯﻣﻨﺪﻱ ﺑﻪ ﻣﻔﻬﻮﻡ ﻣﻘﻴﺎﺱ پﺬﻳﺮﻱ ﻭﺍﺑﺴﺘﻪ ﻣﻲﺑﺎﺷﺪ. ﺑﻨﺎﺑﺮﺍﻳﻦ ﻛﺎﻫﺶ ﺗﻌﺪﺍﺩ ﻓﺮﺍﻳﻨﺪﻫﺎﻳﻲ ﻛﻪ ﺩﺭ checkpoint ﻫﻤﺎﻫﻨگ ﺷﺮﻛﺖ ﺩﺍﺭﻧﺪ، ﻣﻄﻠﻮﺏ ﻣﻲﺑﺎﺷﺪ. q q q 82 ﺩﺭ ﻃﻮﻝ ﻓﺎﺯ ﺍﻭﻝ، checkpoint آﻐﺎﺯ ﻛﻨﻨﺪﻩ ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎﻳﻲ ﻛﻪ ﺑﺎ آﻦ ﺍﺯ checkpoint ﻗﺒﻠﻲ ﺍﺭﺗﺒﺎﻁ ﺩﺍﺷﺘﻪ ﺍﺳﺖ ﺭﺍ ﻣﺸﺨﺺ ﻧﻤﻮﺩﻩ ﻭ ﺑﻪ آﻨﻬﺎ ﻳﻚ ﺩﺭﺧﻮﺍﺳﺖ ﻣﻲﻓﺮﺳﺘﺪ. ﻓﺮﺍﻳﻨﺪ ﺑﻪ ﻣﺤﺾ ﺩﺭﻳﺎﻓﺖ ﺩﺭﺧﻮﺍﺳﺖ، ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎﻳﻲ ﻛﻪ ﺑﺎ آﻦ ﻓﺮﺍﻳﻨﺪ ﺍﺯ checkpoint ﻗﺒﻠﻲ ﺍﺭﺗﺒﺎﻁ ﺩﺍﺷﺘﻪ ﺍﺳﺖ ﺭﺍ ﻣﺸﺨﺺ ﻧﻤﻮﺩﻩ ﻭ ﺑﻪ آﻨﻬﺎ ﻳﻚ ﺩﺭﺧﻮﺍﺳﺖ ﻣﻲﻓﺮﺳﺘﺪ ﻭ ﺑﻪ ﻫﻤﻴﻦ ﻣﻨﻮﺍﻝ ﺗﺎ ﺍﻳﻨﻜﻪ ﻓﺮﺍﻳﻨﺪ ﺩﻳگﺮﻱ ﺑﺮﺍﻱ ﺷﻨﺎﺳﺎﻳﻲ ﻧﺒﺎﺷﺪ. ﺩﺭ ﻃﻮﻝ ﻓﺎﺯ ﺩﻭﻡ، ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎﻳﻲ ﻛﻪ ﺩﺭ ﻓﺎﺯ ﺍﻭﻝ ﺷﻨﺎﺳﺎﻳﻲ ﺷﺪﻧﺪ checkpoint ﻣﻲگﻴﺮﻧﺪ. Rollback Fualt Tolerancy
Communication-induced Checkpointing n n ﺍﺯ ﺍﺛﺮ ﺩﻭﻣﻴﻨﻮ ﺟﻠﻮگﻴﺮﻱ ﻣﻲﻛﻨﺪ ﺑﺎ ﺍﻳﻨﻜﻪ ﺑﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ ﺍﺟﺎﺯﻩ ﻣﻲﺩﻫﺪ ﺑﺮﺧﻲ checkpoint ﻫﺎ ﺭﺍ ﻣﺴﺘﻘﻼ ﺑگﻴﺮﻧﺪ. ﺑﻬﺮﺣﺎﻝ ﻓﺮﺍﻳﻨﺪﻫﺎﻱ ﻣﺴﺘﻘﻞ ﻣﺠﺒﻮﺭ ﺑﻪ ﺿﻤﺎﻧﺖ ﺗﺤﻮﻝ پﻴﺸﺮﻓﺖ ﺧﻂ ﺑﺎﺯﻳﺎﻓﺖ ﻣﻲﺑﺎﺷﺪ. ﺑﻨﺎﺑﺮﺍﻳﻦ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻣﺠﺒﻮﺭ ﺑﻪ گﺮﻓﺘﻦ checkpoint ﺍﺿﺎﻓﻲ ﻣﻲﺑﺎﺷﻨﺪ. ) checkpoint ﺍﺟﺒﺎﺭﻱ( q q 92 Checkpoint ﺍﺟﺒﺎﺭﻱ ﺑﺎﻳﺪ ﻗﺒﻞ ﺍﺯ آﻨﻜﻪ ﻛﺎﺭﺑﺮﺩ ﻣﺤﺘﻮﺍﻱ پﻴﺎﻡ ﺭﺍ پﺮﺩﺍﺯﺵ ﻛﻨﺪ، گﺮﻓﺘﻪ ﺷﻮﺩ ﻛﻪ ﻣﻮﺟﺐ ﺗﺎﺧﻴﺮ ﻭ ﺳﺮﺑﺎﺭ ﺯﻳﺎﺩ ﻣﻲﺷﻮﺩ. ﺩﺭ ﻋﻮﺽ ﺑﺎ checkpoint گﺮﻓﺘﻦ ﻫﻤﺎﻫﻨگ ﻫﻴچ پﻴﺎﻡ ﺧﺎﺻﻲ ﺭﺩ ﻭ ﺑﺪﻝ ﻧﻤﻲﺷﻮﺩ. Rollback Fualt Tolerancy
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 03 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
Log-Based Rollback Recovery n ﺍﺟﺮﺍﻱ ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﻣﻲﺗﻮﺍﻧﺪ ﺑﺎ ﺩﻧﺒﺎﻟﻪﺍﻱ ﺍﺯ ﺑﺎﺯﻩ ﺣﺎﻻﺕ ﻗﻄﻌﻲ ﻣﺪﻝ ﺷﻮﺩ ﻛﻪ ﺷﺮﻭﻉ ﻫﺮ ﺍﺟﺮﺍ ﻭ ﺑﻌﺒﺎﺭﺗﻲ ﺑﺎﺯﻩ ﺑﺎ ﻳﻚ ﺭﺧﺪﺍﺩ ﻏﻴﺮ ﻗﻄﻌﻲ آﻐﺎﺯ ﻣﻲﺷﻮﺩ. q ﺷﺮﻭﻉ ﺑﺎﺯﻩ ﻗﻄﻌﻲ ﻓﻘﻂ ﺑﻪ ﺩﻧﺒﺎﻟﻪﺍﻱ ﺍﺯ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮ ﻗﻄﻌﻲ ﻛﻪ ﻗﺒﻞ ﺍﺯ ﺷﺮﻭﻉ ﺑﺎﺯﻩ ﻣﻲﺑﺎﺷﺪ، ﺑﺴﺘگﻲ ﺩﺍﺭﺩ. Deterministic Interval 4 m 3 m 2 m 1 m Nondeterministic event 13 Rollback Fualt Tolerancy 0 P 1 P
Log-Based Rollback Recovery : Concepts n ﺍﻳﻦ ﺭﻭﺵ ﺑﺮ ﻗﻄﻌﻴﺖ ﺗﻜﻪﺍﻱ ﺗﻜﻴﻪ ﺩﺍﺭﺩ. ﻓﺮﺽ ﻣﻲﻛﻨﺪ ﻫﻤﻪ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮ ﻗﻄﻌﻲ ﻣﻲﺗﻮﺍﻧﺪ ﻣﺸﺨﺺ ﺷﻮﺩ ﻭ ﻋﺎﻣﻞﻫﺎﻱ ﻣﺮﺑﻮﻃﻪ ﻣﻲﺗﻮﺍﻧﺪ ﺩﺭ stable storage ﺫﺧﻴﺮﻩ ﺷﻮﺩ. q ﺑﺎ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﻭ پﺎﺳﺦ ﺑﻪ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮ ﻗﻄﻌﻲ ﺑﻪ ﺗﺮﺗﻴﺐ ﺍﺻﻠﻲ ﺧﻮﺩ، ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﻗﻄﻌﺎ ﻣﻲﺗﻮﺍﻧﺪ ﺣﺎﻟﺖ ﻗﺒﻞ ﺍﺯ ﺧﺮﺍﺑﻲ ﺧﻮﺩ ﺭﺍ ﺑﺎﺯﺳﺎﺯﻱ ﻛﻨﺪ ﺣﺘﻲ ﺍگﺮ ﺍﻳﻦ ﺣﺎﻟﺖ checkpointing ﻧﺸﺪﻩ ﺑﺎﺷﺪ. n n q 23 ﻋﻠﻲ ﺍﻟﺨﺼﻮﺹ ﺑﺮﺍﻱ ﻛﺎﺭﺑﺮﺩﻫﺎﻳﻲ ﻛﻪ ﺑﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺩﺭ ﺗﻌﺎﻣﻞ ﻫﺴﺘﻨﺪ ﻛﻪ ﺷﺎﻣﻞ ﻭﺳﺎﻳﻞ ﻭﺭﻭﺩﻱ/ﺧﺮﻭﺟﻲ ﻛﻪ ﻧﻤﻲﺗﻮﺍﻧﺪ Rollback ﻧﻤﺎﻳﺪ، ﺟﺬﺍﺏ ﻣﻲﺑﺎﺷﺪ. ﺑﺎ ﺍﻳﻦ ﺣﺎﻝ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﺑﺮﺍﻱ ﻛﺎﻫﺶ گﺴﺘﺮﺵ ﻋﻘﺐ گﺮﺩ ﺩﺭ ﻃﻮﻝ ﺑﺎﺯﻳﺎﻓﺖ checkpointing ﺭﺍ ﺍﻧﺠﺎﻡ ﻣﻲﺩﻫﺪ. Rollback ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺑﻪ ﺣﺎﻟﺘﻲ ﻓﺮﺍﺗﺮ ﺍﺯ checkpoint ﻫﺎﻱ ﺳﺎﺯگﺎﺭ ﺍﺧﻴﺮ ﻣﻲﺭﺳﺪ ﻭ ﺿﻤﺎﻧﺖ ﻣﻲﻛﻨﺪ ﺳﻴﺴﺘﻢ ﻫﻴچ ﻓﺮﺍﻳﻨﺪ ﻳﺘﻴﻤﻲ ﺭﺍ ﺗﻮﻟﻴﺪ ﻧﻜﻨﺪ. Rollback Fualt Tolerancy
ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺑﺪﺑﻴﻨﺎﻧﻪ ﺩﺭ ﻣﻘﺎﺑﻞ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ n n 33 ﺩﺭ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺑﺪﺑﻴﻨﺎﻧﻪ ﻛﺎﺭﺑﺮﺩ ﺑﺎﻳﺪ ﺑﻠﻮﻙ ﺷﺪﻩ ﻭ ﻣﻨﺘﻈﺮ ﻋﺎﻣﻞﻫﺎﻱ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮﻗﻄﻌﻲ ﺷﻮﺩ. ﻭ ﻗﺒﻞ ﺍﺯ آﻨﻜﻪ ﺍﺛﺮ آﻦ ﺭﺧﺪﺍﺩ ﺗﻮﺳﻂ ﺑﻘﻴﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ ﻳﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺩﻳﺪﻩ ﺷﻮﺩ، ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺷﻮﺩ. q ﺧﻄﺎ ﻣﻲﺗﻮﺍﻧﺪ ﺑﻌﺪ ﺍﺯ ﻫﺮ ﺭﺧﺪﺍﺩ ﻏﻴﺮ ﻗﻄﻌﻲ ﺭﺥ ﺩﻫﺪ. ﺩﺭ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ، ﻛﺎﺭﺑﺮﺩ ﺑﻠﻮﻙ ﻧﻤﻲﺷﻮﺩ. ﻋﺎﻣﻠﻬﺎ ﺩﺭ ﻳﻚ log ﻣﻮﻗﺘﻲ ﻧگﻬﺪﺍﺭﻱ ﻣﻲﺷﻮﺩ ﻭ ﺑﻪ ﺻﻮﺭﺕ آﺴﻨﻜﺮﻭﻥ ﺩﺭ stable storage ﺗﺨﻠﻴﻪ ﺷﻮﺩ. q ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﻗﺒﻞ ﺍﺯ ﺍﻳﺠﺎﺩ ﺧﻄﺎ ﻛﺎﻣﻞ ﻣﻲﺷﻮﺩ. Rollback Fualt Tolerancy
Log-base ﺑﺮ ﺍﺳﺎﺱ ﺍﻳﻨﻜﻪ ﻋﺎﻣﻞﻫﺎ چگﻮﻧﻪ ﻧگﺎﺷﺘﻪ ﺷﻮﻧﺪ ﺩﺍﺭﺍﻱ ﺳﻪ ﺭﻭﺵ گﻮﻧﺎگﻮﻥ ﺍﺳﺖ n n n 43 پﺮﻭﺗﻜﻞ ﺑﺪﺑﻴﻨﺎﻧﻪ ﺿﻤﺎﻧﺖ ﻣﻲﻛﻨﺪ ﻛﻪ ﺑﺨﺎﻃﺮ ﺧﺮﺍﺑﻲ ﻳﺘﻴﻤﻲ ﺗﻮﻟﻴﺪ ﻧﺨﻮﺍﻫﺪ ﺷﺪ. ﺍﻳﻦ پﺮﻭﺗﻜﻞ ﺳﺒﺐ ﺳﺎﺩگﻲ ﺩﺭ recovery ﻭ Garbage collection ﻭ ﺧﺮﻭﺟﻲ ﺑﺎ ﻫﺰﻳﻨﻪ ﺳﺮﺑﺎﺭ ﻛﺎﺭﺍﻳﻲ ﺑﺎﻻﺗﺮ ﻫﻨگﺎﻡ ﻋﺎﺭﻱ ﺍﺯ ﺧﻄﺎ ﺑﻮﺩﻥ ﻣﻲﺷﻮﺩ. پﺮﻭﺗﻜﻞ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ ﺳﺮﺑﺎﺭ ﻛﺎﺭﺍﻳﻲ ﺑﺪﻭﻥ ﺧﻄﺎ ﺑﻮﺩﻥ ﺭﺍ ﻛﺎﻫﺶ ﻣﻲﺩﻫﺪ ﺍﻣﺎ ﺍﺟﺎﺯﻩ ﺍﻳﺠﺎﺩ ﻳﺘﻴﻢ ﺭﺍ ﺑﺨﺎﻃﺮ ﺧﻄﺎ ﻣﻲﺩﻫﺪ. ﺍﺣﺘﻤﺎﻝ ﺩﺍﺷﺘﻦ ﻳﺘﻴﻢ ﺑﺎﺯﻳﺎﻓﺖ ﻭ Garbage collection ﻭ ﺧﺮﻭﺟﻲ ﺭﺍ پﻴچﻴﺪﻩ ﻣﻲﻛﻨﺪ. پﺮﻭﺗﻜﻞ ﺳﺒﺒﻲ ﺳﻌﻲ ﺑﺮ ﺗﺮﻛﻴﺐ ﻣﺰﺍﻳﺎﻱ ﺳﺮﺑﺎﺭ پﺎﺋﻴﻦ ﻛﺎﺭﺍﻳﻲ ﻭ ﺧﺮﻭﺟﻲ ﺳﺮﻳﻊ ﺭﺍ ﺩﺍﺭﺩ. ﺍﻣﺎ ﻣﻤﻜﻦ ﺍﺳﺖ ﻧﻴﺎﺯﻣﻨﺪ recovery ﻭ Garbage collection پﻴچﻴﺪﻩ گﺮﺩﺩ. Rollback Fualt Tolerancy
چگﻮﻧﻪ ﺑﺪﺑﻴﻨﺎﻧﻪ ﺑﺎﺯﻳﺎﻓﺖ ﻣﻲﻛﻨﺪ؟ Logs determinants Receipt message m 7 from P 1 Roll forward use determinant logs to deliver same sequence of messages. {m 0, m 4, m 7} {m 1, m 3, m 6} P 1, P 2 Fail {m 2, m 5} Restart from Recovery is complete Both state Z, Y is consistent with X Rollback Fualt Tolerancy 35
ﺩﺭ ﺳﻴﺴﺘﻢ log ﻛﺮﺩﻥ ﺑﺪﺑﻴﻨﺎﻧﻪ. . n n ﺣﺎﻟﺖ ﻗﺎﺑﻞ ﻣﺸﺎﻫﺪﻩ ﻫﺮ ﻓﺮﺍﻳﻨﺪ ﻫﻤﻴﺸﻪ ﻗﺎﺑﻞ ﺑﺎﺯﻳﺎﻓﺖ ﺍﺳﺖ. ﻣﺰﺍﻳﺎ: q q ﻓﺮﺍﻳﻨﺪﻫﺎ ﻣﻲﺗﻮﺍﻧﺪ ﺑﺪﻭﻥ ﺍﺟﺮﺍﻱ پﺮﻭﺗﻜﻞ ﺧﺎﺻﻲ ﺑﻪ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺧﺮﻭﺟﻲ ﺻﺎﺩﺭ ﻛﻨﻨﺪ. ﻓﺮﺍﻳﻨﺪﻫﺎ ﺍﺯ checkpoint ﺍﺧﻴﺮ ﺧﻮﺩ ﺑﻪ ﻣﺤﺾ ﻭﻗﻮﻉ ﺧﺮﺍﺑﻲ ﻣﺠﺪﺩ ﺷﺮﻭﻉ ﻣﻲﻛﻨﻨﺪ. n q ﺑﺎﺯﻳﺎﻓﺖ ﺳﺎﺩﻩ ﺷﺪ ﺯﻳﺮﺍ ﺍﺛﺮ ﺧﺮﺍﺑﻲ ﻓﻘﻂ ﻣﺤﺼﻮﺭ ﺑﻪ ﻓﺮﺍﻳﻨﺪﻱ ﻛﻪ ﺧﺮﺍﺏ ﺷﺪﻩ ﻣﻲﺷﻮﺩ. n q 63 ﻳﻚ ﻓﺮﺍﻳﻨﺪ ﻫﻴچ ﻭﻗﺖ ﻳﺘﻴﻢ ﻧﺨﻮﺍﻫﺪ ﺷﺪ. ﺯﻳﺮﺍ ﻓﺮﺍﻳﻨﺪ ﻫﻤﻴﺸﻪ ﺑﻪ ﺣﺎﻟﺘﻲ ﻛﻪ ﺷﺎﻣﻞ ﺗﻌﺎﻣﻞ ﺍﺧﻴﺮ ﺑﺎ ﻓﺮﺍﻳﻨﺪﻳﻬﺎﻱ ﺩﻳگﺮ ﻳﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺍﺳﺖ، ﺑﺎﺯ ﻣﻲگﺮﺩﺩ. ﺍﻃﻼﻋﺎﺕ ﺑﺎﺯﻳﺎﻓﺖ ﻣﻲﺗﻮﺍﻧﺪ ﺑﻪ آﺴﺎﻧﻲ ﺩﻭﺭ ﺭﻳﺨﺘﻪ ﺷﻮﺩ. n n ﻣﺤﺪﻭﺩ ﻛﺮﺩﻥ گﺴﺘﺮﺵ ﺍﺟﺮﺍﻱ ﻣﺠﺪﺩ Checkpoint ﻫﺎﻱ ﻗﺪﻳﻤﻲﺗﺮ ﻭ ﻋﺎﻣﻞﻫﺎﻱ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮﻗﻄﻌﻲ ﻗﺒﻞ ﺍﺯ checkpoint ﺍﺧﻴﺮ ﻣﻲﺗﻮﺍﻧﻨﺪ ﺣﺬﻑ ﺷﻮﺩ. ﻫﺰﻳﻨﻪﺍﻱ ﻛﻪ ﺑﺮﺍﻱ ﺍﻳﻦ ﻣﺰﺍﻳﺎ ﺑﺎﻳﺪ پﺮﺩﺍﺧﺖ ﺷﻮﺩ ﺟﺮﻳﻤﻪ ﻛﺎﺭﺍﻳﻲ ﺑﺎ ﻭﺍﻗﻌﻪ ﻧگﺎﺭﻱ ﺳﻨﻜﺮﻭﻥ ﻣﻲﺑﺎﺷﺪ. Rollback Fualt Tolerancy
: ﻛﺎﻫﺶ ﺳﺮﺑﺎﺭ ﺗﻮﺳﻂ ﺭﻭﺵ Sender-Based Message Logging (SBML) ﺭﺍ ﺩﺭ ﺣﺎﻓﻈﻪ ﻣﻮﻗﺖ ﻓﺮﺳﺘﻨﺪﻩ m ﻋﺎﻣﻞﻫﺎﻱ ﻣﺘﻨﺎﻇﺮ ﺑﺎ ﺍﻧﺘﻘﺎﻝ پﻴﺎﻡ SMBL. ﻧگﻬﺪﺍﺭﻱ ﻣﻲﻛﻨﺪ . ﺷﺎﻣﻞ ﻣﺤﺘﻮﺍ ﻭ ﺗﺮﺗﻴﺒﻲ ﻛﻪ ﺍﺭﺳﺎﻝ ﺷﺪﻩ ﺩﺭ ﺩﻭ ﻣﺮﺣﻠﻪ ﺛﺒﺖ ﻣﻲﺷﻮﺩ m ﻋﺎﻣﻞﻫﺎﻱ n q Before Sending m Sender logs its content in volatile memory Then receiver responds with an ack. includes order in which the message was delivered. Sender adds to the determinant ordering information ﻓﻘﻂ ﻣﻲﺗﻮﺍﻧﺪ ﻳﻚ ﺧﺮﺍﺑﻲ ﺭﺍ ﺗﺤﻤﻞ ﻧﻤﺎﻳﺪ ﻭ ﺭﺧﺪﺍﺩﻫﺎﻱ ﻏﻴﺮﻗﻄﻌﻲ ﺩﺍﺧﻞ ﻳﻚ SMBL. ﻓﺮﺍﻳﻨﺪ ﺭﺍ ﻧﻤﻲﺗﻮﺍﻧﺪ ﺣﻞ ﻧﻤﺎﻳﺪ Rollback Fualt Tolerancy q 37
ﻛﺎﻫﺶ ﺳﺮﺑﺎﺭ ﺗﻮﺳﻂ ﺭﻭﺵ: Relaxing Logging Atomicity n ﺑﺎ ﺍﺭﺳﺎﻝ پﻴﺎﻡ ﻳﺎ ﺭﺧﺪﺍﺩ ﻭ ﺗﻔﺎﻭﺕ ﻗﺎﺋﻞ ﺷﺪﻥ ﻣﻴﺎﻥ ﺛﺒﺖ آﻨﻬﺎ ﺗﺎ ﺯﻣﺎﻧﻲ ﻛﻪ ﻣﻴﺰﺑﺎﻥ ﺑﺎ ﻣﻴﺰﺑﺎﻥﻫﺎﻱ ﺩﻳگﺮ ﻳﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺍﺭﺗﺒﺎﻁ ﺑﺮﻗﺮﺍﺭ ﻧﻤﺎﻳﺪ. q q n ﺛﺒﺖ ﺭﺧﺪﺍﺩ ﻭ ﺍﺭﺳﺎﻝ آﻦ ﺩﺭ ﺍﻳﻦ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ ﻧﻤﻲﺗﻮﺍﻧﺪ ﺩﺭ ﻋﻤﻠﻴﺎﺕ ﻳﻚ ﻟﺤﻈﻪﺍﻱ ﺍﻧﺠﺎﻡ ﺷﻮﺩ. q 83 ﺩﺭ 0 P ﺛﺒﺖ پﻴﺎﻡﻫﺎﻱ 4 m 7 , m ﺗﺎ ﺯﻣﺎﻧﻲ ﻛﻪ ﺑﺎ ﻓﺮﺍﻳﻨﺪﻫﺎﻱ ﺩﻳگﺮ ﻳﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺝ ﺩﺭ ﺍﺭﺗﺒﺎﻁ ﻫﺴﺘﻨﺪ، ﺗﻐﻴﻴﺮ ﻣﻲﻛﻨﺪ. پﻴﺎﻡﻫﺎﻱ 4 m ﻭ 7 m ﻣﺠﺎﺯ ﺑﻪ ﺍﺛﺮ گﺬﺍﺭﻱ ﺑﺮ ﺭﻭﻱ ﻓﺮﺍﻳﻨﺪ 0 P ﻣﻲﺑﺎﺷﻨﺪ، ﺍﻣﺎ ﺍﻳﻦ ﺍﺛﺮ ﻣﺤﻠﻲ ﺍﺳﺖ ﻭ ﻓﺮﺍﻳﻨﺪ ﺩﻳگﺮﻱ ﻳﺎ ﺩﻧﻴﺎﻱ ﺧﺎﺭﺡ آﻦ ﺭﺍ ﻧﻤﻲﺗﻮﺍﻧﻨﺪ ﺑﺒﻴﻨﻨﺪ ﺗﺎ ﺍﻳﻨﻜﻪ پﻴﺎﻡ ﺛﺒﺖ ﺷﻮﺩ. ﺍﻳﻦ ﻃﺮﺡ ﻣﻲﺗﻮﺍﻧﺪ ﺳﺮﺑﺎﺭ ﺭﺍ ﻛﺎﻫﺶ ﺩﻫﺪ، ﺯﻳﺮﺍ چﻨﺪﻳﻦ ﺭﺧﺪﺍﺩ ﻣﻲﺗﻮﺍﻧﺪ ﺩﺭ ﻳﻚ ﻋﻤﻠﻜﺮﺩ ﻛﻪ ﺑﺎﻋﺚ ﻛﺎﻫﺶ ﺗﻌﺪﺩ ﺩﺳﺘﺮﺳﻲ ﺑﻪ stable storage ﻣﻲﺷﻮﺩ، ﺛﺒﺖ ﺷﻮﺩ. ﺗﺎﺧﻴﺮ ﺍﺭﺗﺒﺎﻁ ﺑﻴﻦ ﻓﺮﺍﻳﻨﺪﻱ ﻭ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ ﻛﺎﻫﺶ ﻧﻤﻲﻳﺎﺑﺪ ﺯﻳﺮﺍ ﻋﻤﻠﻴﺎﺕ ﺛﺒﺖ ﻣﻌﻤﻮﻻ ﻗﺒﻞ ﺍﺯ ﺍﺭﺳﺎﻝ پﻴﺎﻡ ﻧﻴﺎﺯ ﺍﺳﺖ. Rollback Fualt Tolerancy
چگﻮﻧﻪ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ ﺑﺎﺯﻳﺎﻓﺖ ﻣﻲﻛﻨﺪ؟ • ﺍگﺮ ﺩﺭ ﻓﺮﺍﻳﻨﺪﻱ ﺧﺮﺍﺑﻲ ﺭﺥ ﺩﺍﺩ، ﻋﺎﻣﻞﻫﺎ ﺩﺭ log ﻣﻮﻗﺘﻲ آﻦ ﺍﺯ ﺑﻴﻦ ﺧﻮﺍﻫﻨﺪ ﺭﻓﺖ. • ﺭﻭﺵ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ ﺷﺮﻁ ﻫﻤﻴﺸﻪ ﺑﺪﻭﻥ ﻳﺘﻴﻢ ﺭﺍ پﻴﺎﺩﻩ ﻧﻤﻲﻛﻨﺪ. • پﺮﻭﺗﻜﻞﻫﺎﻱ ﺧﻮﺷﺒﻴﻨﺎﻧﻪ ﻧﻴﺎﺯ ﺑﻪ ﺍﻳﻦ ﺩﺍﺭﺩ ﻛﻪ چﻨﺪﻳﻦ checkpoint ﺭﺍ ﻧگﻬﺪﺍﺭﻱ ﻧﻤﺎﻳﺪ. • ﺑﺮﺍﻱ آﺴﻨﻜﺮﻭﻥ ﺑﻮﺩﻥ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ ﻧﻴﺎﺯﻣﻨﺪ ﻫﻤﺎﻫﻨگﻲ چﻨﺪﻳﻦ ﻣﻴﺰﺑﺎﻥ ﻣﻲﺑﺎﺷﺪ. P 0 roll back to undo 7 effects of m Need to Commit output Orphan Ask to log P 1 become Orphan Restart from B instead D X Before m 5 is logged 93 Log P 2 rollback Rollback Fualt Tolerancy
Recovery ﺳﻨﻜﺮﻭﻥ n ﻫﻤﻪ ﻓﺮﺍﻳﻨﺪﻫﺎ پﺮﻭﺗﻜﻞ Recovery ﺭﺍ ﺑﺮﺍﻱ ﻣﺤﺎﺳﺒﻪ ﺑﻴﺸﺘﺮﻳﻦ ﺣﺎﻟﺖ ﻗﺎﺑﻞ ﺑﺎﺯﻳﺎﻓﺖ ﺳﻴﺴﺘﻢ، ﺑﺮ ﺍﺳﺎﺱ ﻭﺍﺑﺴﺘگﻲ ﻭ ﺍﻃﻼﻋﺎﺕ ﻧگﺎﺷﺘﻪ ﺷﺪﻩ ﺍﻧﺠﺎﻡ ﻣﻲﺩﻫﻨﺪ. ﺳپﺲ Rollback ﺭﺍ ﺍﻧﺠﺎﻡ ﻣﻲﺩﻫﻨﺪ. ﻣﺴﺘﻘﻴﻢ n ﻭﺍﺑﺴﺘگﻲ q ﺍﻧﺪﻳﺲ ﺑﺎﺯﻩ ﻓﺮﺳﺘﻨﺪﻩ ﺑﺮ ﺭﻭﻱ ﻫﺮ پﻴﺎﻡ ﺧﺮﻭﺟﻲ ﺑﺮﺍﻱ ﺍﻳﻨﻜﻪ ﺑﻪ ﺩﺭﻳﺎﻓﺖ ﻛﻨﻨﺪﻩ ﺍﺟﺎﺯﻩ ﺩﻫﺪ ﺗﺎ ﻭﺍﺑﺴﺘگﻲ ﻛﻪ ﻣﺴﺘﻘﻴﻤﺎ ﺑﻮﺍﺳﻄﻪ پﻴﺎﻡ ﺍﻳﺠﺎﺩ ﺷﺪﻩ، ﺿﺒﻂ ﻧﻤﺎﻳﺪ. n ﻭﺍﺑﺴﺘگﻲﻣﺘﻌﺪﻱ q ﻭﺍﺑﺴﺘگﻲ ﻣﺘﻌﺪﻱ ﻋﻤﻮﻣﺎ ﺳﺒﺐ ﺍﻳﺠﺎﺩ ﺳﺮﺑﺎﺭ ﺑﺎﻻﺗﺮﻱ ﺑﺮﺍﻱ ﺳﻮﺍﺭ ﺷﺪﻥ ﺑﺮ پﻴﺎﻡﻫﺎ ﻭ ﻧگﻬﺪﺍﺭﻱ ﺑﺮﺩﺍﺭ ﻭﺍﺑﺴﺘگﻲ ﻣﻲﺷﻮﺩ. ﺍﻣﺎ ﺻﺪﻭﺭ ﺧﺮﻭﺟﻲ ﻭ ﺑﺎﺯﻳﺎﻓﺖ ﺳﺮﻳﻌﺘﺮﻱ ﺭﺍ ﺑﻪ ﺍﺭﻣﻐﺎﻥ ﻣﻲآﻮﺭﺩ. ﺭﻭﺵ آﻦ ﺑﻪ ﺷﻜﻞ ﺯﻳﺮ ﺍﺳﺖ: Each process Pi maintains a size-N vector TDi, where TDi]i] is Pi’s current state interval index, & TDi]j], j≠i, records the highest . index of any state interval of Pj on which Pi depends 04 Rollback Fualt Tolerancy n
Multiple incarnations of the same process may coexist in the Asynchronous Recovery If a single failure causes a process to roll back an exponential number of times. In general which process Pi , i > 0, rolls back 2 i-1 times in response to P 0’s failure. Approach is to piggyback the original rollback announcement on any subsequent rollback announcement (P 1 piggybacks r 0 on r 1). ]i, x]: xthinterval of ithincarnation loses P 0 fails Rollback announcement r 1 reaches P 2 before r 0 Rollback Fualt Tolerancy 41
P 0 at X logged determinants of m 0, m 1, m 2, m 3, m 4 m 5 , m 6 may be lost Determinant of each events contains: order in which its original receiver delivered the corresponding message. P 0 will be able to guide the recovery of P 1 , P 2 since it knows the order in which P 1 should replay messages m 1 , m 3 to reach the state from which P 1 sends message m 4. Notice information about m 5, m 6 is not available anywhere. Rollback Fualt Tolerancy 42
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 34 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
: ﻣﻘﺎﻳﺴﻪ Rollback Fualt Tolerancy 44
چﺎﺭچﻮﺏ ﻣﻄﺎﻟﺐ n n n 54 ﻣﻘﺪﻣﻪ ﺗﻌﺎﺭﻳﻒ پﺮﻭﺗﻜﻞﻫﺎﻱ checkpointing پﺮﻭﺗﻜﻞﻫﺎﻱ ﺑﺮ ﻣﺒﻨﺎﻱ ﻭﺍﻗﻌﻪﻧگﺎﺭﻱ ﻣﻘﺎﻳﺴﻪ ﻣﺮﺍﺟﻊ Rollback Fualt Tolerancy
Survey n n E. N. Elnozahy, D. B. Johnson, and Y. M. Wang, "A survey of rollback-recovery protocols in messagepassing systems, " Tech. Rep. No. CMU-CS-96 -181, Dept. of Computer Science, Carnegie Mellon University, 1996. L. Alvisi and K. Marzullo, "Message Logging: Pessimistic, Optimistic, and Causal, " Proceedings of the 15 th IEEE International Conference on Distributed Computing Systems. Vancouver, Canada, June 1995, pp. 229 -236. Rollback Fualt Tolerancy 46
Model & Consistency n n n n K. M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems, " ACM Trans. on Computer Syst. , vol. 3, no. 1, pp. 63 -75, Feb. 1985. Y. M. Wang, A. Lowry, and W. K. Fuchs, "Consistent global checkpoints based on direct dependency tracking, " Information Processing Letters, Vol. 50, No. 4, pp. 223 -230, May 1994. Y. M. Wang, "Maximum and minimum consistent global checkpoints and their applications, " in Proc. IEEE Symp. Reliable Distributed Syst. (SRDS), pp. 86 --95, Sept. 1995. Jian Xu and Robert H. B. Netzer, "Necessary and Sufficient Conditions for Consistent Global Snapshots, "(cs 93 -32. ps), IEEE Trans. on PADS. , Vol. 6, No. 2, February 1995. D. Manivannan and M. Singhal, "A Low-overhead Recovery Technique Using Quasi. Synchronous Checkpointing, " In Proceedings of the 16 th International Conference on Distributed Computing Systems, May 1996, pages 100 -107. D. Manivannan, Robert H. B. Netzer and M. Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation, " (OSU-CISRC-3/96 -TR 16) In IEEE Transactions on Parallel and Distributed Systems, 8(6): 623 -627, June 1997. D. Manivannan and M. Singhal, "Quasi-Synchronous Checkpointing: Models, Characterization, and Classification, " Submitted to IEEE Transactions on Parallel and Distributed Systems. (1999) Rollback Fualt Tolerancy 47
Checkpointing (No logging) n n n Jian Xu and Robert H. B. Netzer, "Adaptive Independent Checkpointing for Reducing Rollback Propagation, " (cs 93 -25. ps) In Proc. 5 th IEEE Symp. on Parallel and Distributed Processing, pages 754 -761, December 1993. B. Bhargava and S. R. Lian, "Independent Checkpointing and Concurrent Rollback for Recovery - An Optimistic Approach, " In Proc. of IEEE Symp. on Reliable Distributed Syst. , pp. 2 -12, 1988. R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems, " IEEE Trans. on Software Eng. , vol. SE-13, no. 1, pp. 23 -31, Jan. 1997. J. L. Kim and T. Park, "An Efficient Protocol for Checkpointing Recovery in Distributed Systems, " IEEE Trans. on Parallel and Distributed Syst. , vol. 4, no. 8, pp. 955 -960, Aug. 1993. Y. M. Wang and W. K. Fuchs, "Lazy checkpoint coordination for bounding rollback propagation, " in Proc. IEEE Symp. on Reliable Distributed Systems (SRDS-12), pp. 78 --85, Oct. 1993. Y. M. Wang, P. Y. Chung, I. J. Lin, and. W. K. Fuchs. “Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. ” IEEE Trans. Parallel and Distributed Syst. , 6(5): 546– 554, May 1995. Rollback Fualt Tolerancy 48
Implementation & Performance n n n Elmootazbellah Nabil Elnozahy, David B. Johnson, and Willy Zwaenepoel, "The Performance of Consistent Checkpointing, " In Proceedings of the 11 th Symposium on Reliable Distributed Systems, pp. 39 -47, IEEE Computer Society, Houston, TX, October 1992. Y. Huang, C. Kintala, and Y. M. Wang, "Software Tools and Libraries for Fault Tolerance, " in Bulletin of the Technical Committee on Operating Systems and Application Environment (TCOS), Vol. 7, No. 4, pp. 5 --9, Winter, 1995. Y. M. Wang, Y. Huang, K. -P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its applications, " in Proc. IEEE Fault-Tolerant Computing Symposium (FTCS-25), pp. 22 -31, June 1995. Roberto Baldoni, Jean Michel Helary, Achour Mostefaoui, Michel Raynal. “Consistent Checkpointing in Message Passing Distributed Systems”. Institut National de Recherche en Informatique et en Automatique, Juin, 1995. Gerard P. Kavanaugh and William H. Sanders. “Performance Analysis of two Time. Based Coordinated Checkpointing Protocols. ” Center for Reliable & High. Performance Computing Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, From Pacfic Rim International Symposium on Fault-Tolerant Systems, Taipei, Taiwan, December 15 -16, 1997. B. Bhargava and S. R. Lian. “Independent checkpointing and concurrent rollback for recovery - An optimistic approach. ” In Proc. IEEE Symp. Reliable Distributed Syst. , pages 3– 12, 1988. Rollback Fualt Tolerancy 49
Miscellaneous n n n n E. Cohen, Y. M. Wang, and G. Suri, "When piecewise determinism is almost true, " in Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pp. 66 -71, Dec. 1995. Y. M. Wang, P. Y. Chung, and W. K. Fuchs, "Tight upper bound on useful distributed system checkpoints, " Tech. Rep. CRHC-95 -16, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1995. Ramamurthy, B. , Upadhyaya, S. , Bhargava, B. , "Design and analysis of an integrated checkpointing and recovery scheme for distributed applications, " IEEE Transactions on Knowledge and Data Engineering, Volume: 12 Issue: 2, March-April 2000 Page(s): 174 -186 Dan Pei, Dongsheng Wang, Meiming Shen, Weimin Zheng, "Design and implementation of a low-overhead file checkpointing approach, " High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on, Volume: 1 , 2000 Page(s): 439 441 vol. 1 Meth, K. Z. , Tuel, W. G. , "Parallel checkpoint/restart without message logging, " International Workshops on Parallel Processing, 2000. Page(s): 253 -258 Yi Zhang, Jianping Hu "Checkpointing and process migration in network computing environment, " Info-tech and Info-net, 2001. Proceedings. ICII 2001 - Beijing. 2001 International Conferences on, Volume: 3 , 2001 Kasbekar, M. ; Das, C. R. , "Selective checkpointing and rollbacks in multithreaded distributed systems, " Distributed Computing Systems, 2001. 21 st International Conference on. , 2001 Page(s): 39 -46 Rollback Fualt Tolerancy 50
8a525c121aa47215ecd4b4844600e00e.ppt