什么是刀片服务器(刀片服务器重启)

圈圈笔记 68

什么是熄灯数据中心?

What is a lights out data center?

如同无纸化办公一样,熄灯数据中心同样难以琢磨。

Like the paperless office, the lights-out data center has proven elusive

February 17, 2021 By Peter Judge

熄灯数据中心的说法已经流传好几年了。为什么要建一个?建完后会一直存在吗?

The idea of a lights-out data center has been in circulation for some years, but why would you want one? And will they ever exist?

这个概念本质上很简单。一个熄灯数据中心是一个完全自动化的设施,不需要有人来运行维护。灯可以关掉,运营方也节约能源和管理费用。

The underlying idea is simple. A lights out data center is a fully-automated facility which can operate without any staff. The lights can be turned out and the operator saves energy and administration costs.

这个概念源于制造业,可以追溯到1955年一个与Philip K Dick相关的小故事。当时荷兰一个飞利浦的工厂制造剃须刀,在少数质控员监督下生产,而在日本的Fanuc工厂,空调和供热曾经关闭了30天,目的是让机器人不受干扰的制造机器人。

The idea comes from manufacturing industries, and dates back to a 1955 short story by Philip K Dick. In the Netherlands, a Philips factory makes razors, supervised by a handful of quality control staff, while at the Fanuc factory in Japan, the air-conditioning and heat are turned off for 30 days at a time, to allow robots to make robots undisturbed.

当数据中心被初步构想熄灯化的时候,计算机系统需要定期的维护和保养。 现场的人员不得不频繁地进入冰冷的机房空间去手动重启服务器和在交换机上跳接线缆。

When data centers were first conceived, computer systems needed regular maintenance and care. On-site staff frequently had to step into the chilled white space, to manually reset servers and rewire switches.

但是这种情况正在改变。IT成套设备已经变得更加可靠,软件定义网络SDN意味着机架能在一次布线后通过软件配置方式重复连接。虚拟化使得工作量不依赖于物理服务器,自动化意味着重启和调整也能在远程实现。

But that is changing. IT kit has been getting more reliable, and software defined networking (SDN) means racks can be cabled up once, with connections made and remade by software. Virtualization makes workloads independent of physical servers, and automation means resets and adjustments can be done remotely.

十多年来,管理员们可以在他们的办公桌上管理成千上万的服务器,不再需要人员进入数据中心的通道调整硬件。数据中心的机械和电气部件也实现了自动化,所以冷冻站可在无人看管下运行,并能提示工程师或呼叫供应商过来处理偶发的预防性维护。

For ten years or more, admins have been managing servers - hundreds or even thousands of them - from their desks. No one needs to visit the aisles of a data center until hardware changes are needed. The mechanical and electrical parts of the data center are also automated, so chillers run unattended and can prompt engineers or call the manufacturer for occasional preventive maintenance.

人们留意到数据中心空间和能源的浪费, 体现在为运维人员提供的合适的工作环境,提供满足人员的需要以及在安全、洗浴及安全的进出。

It has been regularly observed that data centers waste space and energy, by maintaining conditions suitable for humans to work in, and setting up to meet all their other needs, for safety, bathrooms, and secure entry and exit.

熄灯的提法第一次大范围为人所知是在2011年,当时AOL美国在线,一个曾经鼎盛的互联网服务提供商, 安排了一场大型的发布会宣布一项突破性模型,推出了名为ATC的小型无人看管微型设施。 AOL的技术副总裁Mike Manos(曾在微软工作过的数据中心专家)在博客里赞扬了熄灯系统,相信这一技术从根本上改变了日常业务。

The idea of lights out got its first big outing in 2011 when AOL, an Internet provider long past its prime, made a big show announcing a move to a radical model, using small unattended micro-facilities it called the ATC. AOL’s VP of technology Mike Manos, a data center expert who had previously worked at Microsoft, praised lights-out systems in a blog, crediting them with fundamentally changing business as usual.

熄灯概念的炒作

Lights-out hype

十年过去了,AOL已成过往,数据中心对人员配置需求仍然无法满足。

Ten years on, AOL is long gone, and data centers still have an insatiable demand for staff.

ATC流派一直存在,数据中心也经历了整柜推装,将预装好服务器的机架经常整体运送到数据中心中。

It’s true that some ATC ideas have lived on. Data centers often practice rack and roll shipped racks to the data center with servers pre-installed.

设计人员提出将机架和服务器更近地放在一起。如果无需使建筑更加适合人员活动,冷却费用可通过提高运行温度中而大幅降低。去除空气中的氧气能够彻底地防火并且减少腐蚀。

Designers have pointed out that racks and servers could be placed closer together, and cooling bills could be slashed by running them at a hotter temperature, if not for the need to make the building habitable. Removing oxygen from the air could prevent fires completely, and reduce corrosion.

总体而言,目前数据中心仍然有大型基础设施,现场也有人员值守。

But by and large, data centers are still mostly large facilities, with staff on-site all the time.

Uptime Institute, 是数据中心可靠性方面的专家,一直建议数据中心要有现场人员时刻准备好去处理一些突发问题。 Richard F. Van Loo在2015年的Uptime一个题为合适的数据中心人员是可靠运营的关键分享中说到:对于有Tier III 或 Tier IV设施要求的关键业务目标,Uptime Institute建议至少1-2位合格的运维人员365天24x7始终在现场值守。 适量的人员配备是可靠运营的关键。

The Uptime Institute, the go-to expert for data center reliability, has always recommended that data centers have staff on-site ready to deal with any problems. For business objectives that are critical enough to require Tier III or IV facilities, Uptime Institute recommends a minimum of one to two qualified operators on site 24 hours per day, 7 days per week, 365 days per year (24 x 7), said Richard F. Van Loo in a 2015 Uptime briefing, Proper Data Center Staffing is Key to Reliable Operations.

Uptime Institute建议最少1-2位合格的运营人员365天,24x7始终在现场。

Uptime Institute recommends a minimum of one to two qualified operators on site 24 hours per day, 7 days per week, 365 days per year

自从这句话之后发生了一些变化,尤其是服务更小城市的服务商比如 EdgeConneX

There’s been some change since that came out, particularly with providers serving lesser cities, such as EdgeConneX.

我们所有业务的前提是基于熄灯数据中心的 EdgeConneX首席信息官Lance Devin告诉DCD。 我们有2MW的数据中心,缺没有100MW的大型园区。我负担不起把3个工程师、17个安全人员和2个运维人员配置在一个小机房。

Our whole business premise was based on lights out data centers, EdgeConneX CIO Lance Devin told DCD. We have 2MW sites, not 100MW behemoths. I can’t afford to put three engineers and 17 security people and two maintenance people in a site like that.

EdgeConneX有批发型客户,公司与客户同时运行一个隔离的管理系统,这个系统能让客户控制IT硬件,而 EdgeConneX 管理电力和制冷设施。

EdgeConneX has wholesale customers, and runs a segmented management system, which gives customers control of the IT hardware, while EdgeConneX manages the power and cooling infrastructure.

这不是完全熄灯的,EdgeConneX 有远程安全控制,所以客户的人员可在提示音下经识别后通过一个门禁控制室. 无需与任何运维人员碰面。

It’s not entirely lights-out, but EdgeConneX has remote control security, so customer staff can be buzzed in through a mantrap, without meeting any of the operator’s personnel.

禁闭和熄灯

Lockdown and lights-out

大型设施运营方还没感觉到也有必要这样去做。他们都有能力去实现远程管理,其能力2020年在Covid-19新冠疫情背景下也都经历了考验。

Operators with larger facilities haven’t felt the need to do anything like that. But they all have the ability to manage some things remotely - and those powers got tested in 2020, because of the Covid-19 pandemic.

当人们被告知要居家时,数据中心运营商看到了远程控制服务应用的爆发。 QTS 数据中心CTO Brent Bensten,提到, 登录到公司远程管理入口(服务交付平台或SDP)的次数在最初管制时期的三个星期内上涨了30%,用户在该系统上花费的时间是原来的两倍。

When people were told to stay at home, data center operators saw a big surge in use of remote control services. According to Brent Bensten, CTO at QTS Data Centers, logins to the company’s remote management portal, (the service delivery platform or SDP) jumped by 30 percent in the first three weeks of restrictions, with users spending twice as much time on the system.

参观者仍被允许到现场,人们相互间需要保持一定距离,同时发现这些站点仍可在更少干预下运行,很多人找到了远程管理的价值。就像Bensten说的: Covid-19新冠期间利用这些工具是一个理想场景,人们可以远程完成过去需要现场才能做的工作。

Visits were still allowed, but people stayed away, and found that the sites could still operate with much less intervention, and a lot of people discovered the value of remote management. As Bensten put it: Covid-19 is a perfect case to use the tools, so they can do remotely what used to be done on-site.

熄灯还是去技能化

Lights-out or skills-out?

在许多情况下,熄灯仅是一种去技能化数据中心的表现方式,也作为一种削减成本的途径,或是一种应对难以找到技能人员的解决方式。

In many cases, lights-out is a thinly-disguised way to de-skill data centers, either as a cost-cutting measure, or as a way to deal with the real difficulty in finding skilled staff.

施耐德电气的Steven Carlini在一个博客帖子尝试去解释:为什么每个在未来的数据中都将会熄灯" 这个帖子实际上在论证各公司应该实现自有数据中心的熄灯化管理,部分是为了应对疫情,部分是为了应对技能人员的短缺。

Schneider Electric’s Steven Carlini promises to explain Why every data center in future will be lights-out" in a blog post, which in fact argues that companies should make in-house data centers as ‘lights out’ as possible - partly in response to the pandemic, .and partly to deal with the shortage of skills.

Carlini说 熄灯和无人化也许并非完全准确, 因为安保人员将很可能会现场值守,他建议数据中心应该雇佣有基本机械技能的安保人员,让他们完成即插拔类的硬件替换工作。公司正尝试用Zoom指导运维和维修

Lights out and unmanned may not be entirely accurate, says Carlini, as security staff will most likely be on-site." He suggests data centers should hire security guards with basic mechanical skills, and have them do plug-and-play hardware replacements: Companies are already experimenting with Zoom-guided maintenance and repairs.

在很多案例中,熄灯数据中心的概念已经演变为无需技能。

In a lot of cases, the idea of the lights-out data center has morphed into one where skills aren’t needed.

水下探索

Underwater exploration

真正的熄灯数据中心的到底存不存在呢? 可能有一些设施以这种方式运行,但还没人正式告诉DCD已经实现了。 可能有保密的原因,或许像AOL的ATC一样失败了。

So have truly lights-out data centers ever really existed? There may be facilities operating in this manner, but generally they haven’t spoken to DCD about it. That may be for reasons of secrecy or because, like AOL’s ATC, they failed.

但我们知道其中一个重要的例外案例。

But we do know of one major exception.

微软运营了一个很小( 240KW )的数据中心而且公开了两年,的确没有人员值守,因为这个设施位于海底。

Microsoft operated a small (240kW) data center very publicly for two years, with no site visits at all - because that facility was located on the sea bed.

2018年,微软Natick项目研究团队把十二个装满服务器的机架,装放在一个压力舱并且把它沉入苏格兰海岸线外的近海中。两年期间服务器没有被触碰过,研究团队唯一的能与压力舱通信的是电力和网络电缆。

In 2018, a Microsoft research team called Project Natick filled twelve data center racks with servers, loaded them in a pressure vessel, and sank it in the ocean off the coast of Scotland. For two years, the servers were untouched, and the project’s only communication with them was via power and network cables.

2020年,微软重新回收了SSDC-002 (海底二号数据中心),这个项目在Natick 864个服务器上运行Azure云的工作负载并存储了27.6 PB数据,在充满惰性气体氮气的密封舱中实现了无人值守。

When Microsoft retrieved SSDC-002 (subsea data center 2) in 2020, the project had run workloads from the Azure cloud on Natick’s 864 servers and 27.6 petabytes of storage, unattended in a sealed cylinder filled with unreactive nitrogen gas.

我们运行这个密封舱长达25个月8天,没有人触碰过它 Natick项目领导人David Cutler告诉DCD,成果令人满意。

We operated this thing for 25 months and eight days, with nobody touching it, Natick leader David Cutler told DCD. And the results were favorable.

可靠性和摩尔定律

Reliability and Moore’s Law

水下的服务器看似比陆上的服务器可靠性高了7倍。Natick项目用了一批二手的机器,135台服务器放在一个陆上的数据中心,其余的在海底密封舱内。

The underwater servers seem to have been about seven times more reliable than equivalent ones on land. Natick used a batch of second-hand machines, placing 135 in a land-based data center, and the rest in the sub-sea container.

Cutler说:135台放在陆上服务器中损坏了8台, 在水下的855台才损坏6台. 这些服务器都执行同样的运行任务并且没有任何的维护,但路上标准数据中心的震动和有氧环境似乎产生了负面的影响。

From the 135 land servers, we lost eight, says Cutler. In the water, we lost six out of 855. The servers all ran the same tasks and none had any maintenance, but it seems that the vibration and oxygen atmosphere of a standard data center took a toll.

对数据中心熄灯运营的一大反对理由是服务器和存储需要被定期更换的事实。不是因为他们的磨损毁坏,是因为设备过时。数十年来,IT硬件遵循摩尔定律。 每瓦特用电设备性能每18个月左右翻倍,仅从能源成本上看,3年后新的服务器更有效益。

One big objection to lights-out operation is the fact that servers and storage need to be replaced periodically, not because they wear out, but because they are obsolete. For decades, IT hardware followed Moore’s Law. With performance per Watt doubling every 18 months or so, new servers would pay for themselves ever three years, simply by energy costs.

现在硅基工艺正已达到上限,摩尔定律正走向终点。服务器将有一个更长的生命周期, Uptime Institute的研究总监 Rabih Bashroush说: 当更换服务器周期长达9年之久的时候,从节约能源角度看仍有提升空间。

Now silicon processes are hitting limits, and Moore’s law is coming to its end, and servers will have a longer lifetime: There is still a very strong case for savings in energy, says Rabih Bashroush, research director at the Uptime Institute, when replacing servers that are up to nine years old.

Cutler预测这将令运营方朝熄灯化发展 : 数据中心在其生命周期中有很大比例的成本是在服务器方面。 在后摩尔定律的时代,没理由每隔两年就变更基础设施。

Cutler predicts this will make operators move towards lights-out: A huge percentage of the cost of a data center over its lifetime is the servers. In a post-Moore’s Law world, there’s really no reason to change the infrastructure every two years.

边缘节点实现熄灯化

Lights out on the Edge

传统数据中心仍在不断增加人手,一个新的应用领域可能真的需要熄灯化运营:这就是被大肆炒作的边缘计算。

While conventional data centers remain resolutely staffed, a new development may actually require lights-out operation: the much-hyped area of Edge computing.

像物联网、用户流媒体和居家应用等新业务正在推动对分布式低时延资源的需求。

New developments like the Internet of Things, and people streaming media and applications to their homes, is leading to a requirement for low-latency resources that are very distributed.

这意味着非常多的小型设施,安置在距离人和数据源很近的地方。大多数将比Natick’s SSDC-002小很多,有些安装在灯杆上全天候防护性的盒子里。

This means a large number of small facilities, placed close to the people and data sources. Most will be much smaller than Natick’s SSDC-002, and some will be weatherproof boxes on lamp-posts.

为边缘节点提供服务将会是一个经济上的噩梦,只能取消巡站,这很像电话网络被光纤柜取代一样。

Servicing Edge capacity will be an economic nightmare, unless site visits can be all-but eliminated, much as the telephone network has done for fiber cabinets.

Cutler说: 走向熄灯化,就像我们所做一样,当你认为边缘最终会自动实现运维时。 人们会很长时间都不愿跑一趟,因为去一趟的确太难了。

They will tend to be lights out, like what we did, says Cutler. When you think about the Edge you’re gonna end up with things that operate on their own. People don’t go there for a long time because it’s too hard to get there.

这让我们回想到起初熄灯概念的诞生。 当Mike Manos在AOL提出了这一方法,他实际上讲的就是边缘设施,为的是让AOL把用户驱动的内容可被分发出去并且贴近用户。 讽刺的是,这与当时新星Facebook集中化方式迥然不同,Manos称AOL正努力成为一个内容提供巨头, 读者在哪里就应该覆盖到哪里。

That takes us right back to the birth of lights-out. When Mike Manos launched the idea at AOL, he was actually talking about Edge facilities, designed to get AOL’s user-driven content out close to customers. In a somewhat ironic dig against the centralized approach of an upstart called Facebook, Manos, said AOL was moving to become a big content player: You do need coverage where the readers are."

熄灯化将需要更严谨的技术,实现起来并不轻松。 一套安装在墙上装有服务器的盒子不要指望其可爱和引人瞩目。熄灯化将会到来,因为今后会有免维护可忽略的成套设备。

Lights-out will require serious technology, but it won’t be glamorous. A set of servers in a box on a wall simply cannot demand love and attention. Lights-out will come in because we will have kit that we simply need to neglect.

翻译:

祝叶 Leaf Zhu

General Manager of ICT-Event

DKV(Deep Knowledge Volunteer)普通成员

校对:

Eric

DKV(Deep Knowledge Volunteer)创始成员

公众号声明:

本文并非原文官方认可的中文版本,仅供学习参考,不用于任何商业用途,版权归DCD及作者所有,文章内容请以英文原版为准。中文版未经公众号DeepKnowledge书面授权,请勿转载。

上一篇:

下一篇:

  推荐阅读

分享