Digital Preservation 2013 Panel Green Bytes Sustainable Approaches to Digital Stewardship

from the Library of Congress in Washington DC you good morning everyone I'm Josh sternfeld the Senior Program Officer with the national endowment for the humanities division of preservation and access first I want to thank Aaron angle and bill and the entire conference committee for their tremendous support for putting this helping to organize this panel today so why are we here today well from emerging fields such as digital forensics to press coverage of massive server farms constructed by Apple Google and other major corporations we are developing an acute social awareness of the materiality of data and consequently the environmental costs for preserving our digital heritage the question remains whether we can achieve mutually beneficial compatibility between stewardship of our digital collections and stewardship of the planet the formula for achieving greater environmental sustainability is relatively clear cut a reduction in energy and resource consumption effectively can lead to greater cost efficiency but models for how we arrived at that end point can be enormously varied complex and with roblox settled along the entire pathway as will become apparent from the presentations by our esteemed panelists a comprehensive examination of the topic requires an interdisciplinary approach that engages multiple interest groups including preservationist IT specialists administrators computer engineers environmentalists and many others any solutions would likely also challenge our shared notions of best practices surrounding digital stewardship such as selection curation acquisition and even our perception of long-term preservation and access while we hope to cover a lot of ground with our session today at success depends on continuing the discussion throughout this conference and beyond for that reason we have included a brief abstract as Aaron indicated in your registration packet complete with suggested readings to kickstart further inquiry and that abstract is also posted online and we hope to see the conversation picked up in social media we have a hashtag designated green big Prez as well as various websites and blogs so allow me now to introduce our three presenters starting us off will be David Rosenthal who is perhaps best known in the NDS a community for having started the laksa program through Stanford University before coming to Stanford David worked at Sun Microsystems nvidia and vitria technology he currently holds 23 patents and these days you can find a more often than not traveling the globe to discuss economic models for long-term storage Chris Carpenter is the director of the web archive at the Internet Archive where she works with national libraries archives and universities to provide technical expertise and services and web archiving and web search for the last 15 years she is divided her time between the online consumer and business to business services and software sectors and finally Krishna Khan is both research professor at the center of secure information systems at George Mason University and program director in the computer and networks systems division cluster within the National Science Foundation previously Krishna worked for Intel on future server architectures and technologies and it currently allocates his time among research in sustainability and energy issues in data center designs cloud computing and Internet infrastructure so after their brief presentations we will certainly open up the floor for an extended Q&A discussion so David thank you so two quick notes as usual the text of this talk will go up on my blog after the talk so you don't need to take notes and will be links to the sources and everything and secondly I got to apologize for reading the talk because I'm getting to an age where my mind tends to wander and I want to stay on track because we don't have much time so however much we would all like to be environmentally responsible in their digital preservation activities it's an unfortunate fact that that reducing our energy demand isn't the biggest problem we face as a Blue Ribbon task force on sustainable digital preservation and access pointed out three years ago to no one's surprise the big problem is economic no one has the money to preserve even the stuff they think is a high priority the only way that funds for preservation can be justified is by a commitment to provide access so the research into the historical costs of digital preservation can be summarized by this rule of thumb ingest takes about a half preservation mainly storage takes about a third and access about one sixth of the total how much of the storage cost is power represented so let's take a seagate full terabyte drive which has a retail price of one hundred and seventy dollars and a typical operating power drawer of seven and a half watts and the current Palo Alto utilities small business rate which is about a 13 Center kilowatt hour assuming it was operating for the whole of a four-year service life the disk Powell cost about $35 bank blaze uses 45 similar but more expensive hitachi drives to build each storage pod 3.0 which has dual redundant 760 what power supplies assuming that the pod can survive a power supply failure with all the drives operating the drives would take 337 watts and the rest of the system 422 watts or a total of about two and a quarter times the disk power so the drive share of the total system build cost is 213 dollars odd with its share of the system power the drive would use over 4 years almost seventy nine dollars worth of power or twenty-seven percent of the total four-year cost so as you can see power is a significant cost of preservation but even if disc is the only medium you're using it's probably only about ten percent of the total is about one-third as important as the cost of the disc media so eventually power is going to become a priority criders lower the exponential increase in disk density used to mean that you could grow your collection forty percent a year at approximately constant power about three years ago it became clear that kreider's law was slowing now even the industry's optimistic projections are for no more than twenty percent many archives grow more than twenty percent a year the more they do the more their power costs will increase so although demands a computation during ingest can be high they're over quickly and don't contribute must to the energy demand of long-term preservation and we have a look very low energy way to store data we write it to durable offline media and put them in a salt mine but this doesn't provide usable access and it rule result preservation issues about media obsolescence and integrity verification well we can get a little bit of access by putting the offline media and robots which is actually a lot less safe than a salt mine but the robot infrastructure uses energy all the time and the exponential increase in storage media density means that we don't keep the media for their theoretical life but migrate using energy to new amido when they're no longer dense enough to justify their slot in the robot but increasingly as the first talk today pointed out the access scholars one is keyword search and other forms of data mining robots full of offline media can't support this so a recent paper on characteristics of low-carbon data centers shows that the key to reducing power consumption is to use the service efficiently assigning and migrating tasks to keep the power up servers fully utiful utilized and keep as many as possible powered down while it's easy to migrate tasks among servers to keep only a fraction of them powered up this just isn't practical for storage even before the demand for search and data mining research showed that there were very few hot spots in the access patterns to preserve data search and data mining will spread access fairly evenly across the entire collection this is likely to raise both the proportion of costs attributable to access more than a sex and the proportion attributable to power that so to satisfy the demands of scholars at least one copy of your preserved data has to be on disk or some other medium with equivalent access latency and bandwidth can we design a storage medium that provides rapid yet energy efficient access together with very low energy usage through time so in 2009 a team from CMU show that a fabric consisting of very large number of very low power CPUs each with a fairly small amount of flash memory could answer key value queries at the same speed as conventional servers using two orders of magnitude less energy per query they call their architecture form fast array of wimpy nodes it worked well because the key value problem paralyzes well and because the i/o performance of the wimpy CPU and the flash memory was actually much better than disk so in 2011 Ian Adams Ethan Miller and I showed that if the lifecycle costs were properly accounted for a similar approach could be cost-competitive with discs for long-term storage despite the much higher initial cost of disk it would provide a rapid access with a much lower energy demand we called our architecture dawn durable array of wimpy nodes it worked well because of a series of synergistic effects that greatly reduce power consumption led to a much longer media service life unfortunately there's an important caveat namely if lifecycle costs were probably accounted for the much higher capital costs of dawn our balance my much lower running costs over a much longer media service life than disk systems like Bank blazes I don't know any institution operating a digital archive that has a planning horizon long enough to make that trade-off most operate on an annual budget cycle large savings in say years 4 through 10 are ignored now Amazon a company that's notorious for not caring about making a profit does have a long enough planning horizon it's one of the reason that that they dominate the market for web services and especially storage so because you're short-termism means that you aren't going to buy the initially more expensive but in the longer term cheaper dawn systems vendors aren't going to make them they have their own version of short-termism they're very happy to sell you a product with a limited service life doing so is called planned obsolescence and has a long history in the storage world it's driven bike riders law in 2009 I blogged about Dave Anderson's description of seagates investigation of the idea of a disk drive specifically for archival use technologically it was easy could build a very reliable very long-lived drive but there was no many way to make money building it one reason was that customers wouldn't buy it the economics for them of replacing older drives with newer ones that were identical in all respects except for greater capacity or irresistible the other reason was that even if consumers did want to buy these drives there would be a niche product sold in small volumes so they would cost a lot more per bite than the consumer drives customers are with good reason skeptical of manufacturers claims for reliability thus even if the special archival disks actually did repay the additional cost in greater reliability would likely not be possible to persuade the customers of this of course as quite as lower slows down it makes sense to keep the drive so longer but it slows down slowly never providing at any one given time a big enough motivation to invest the extra up front to get the lower running costs out to the planning horizon so the pessimistic conclusion is the bulk of preserved data is going to be on hard disk burning more power than it should for a good long time economics means that even the dramatic technological change even if it can reduce power consumption by orders of magnitude isn't viable in the marketplace this is both because power isn't yet a big part of the total cost of preservation and because institutions systematically discount the impact of future costs nevertheless there are things that you can do while we wait for the dramatic technological change that can reduce your prior consumption they won't make a big difference but every little helps and this is where I hand on to the people who are trying to do that i may just remain seated is that all right i 2 i'm going to use notes for a couple of different reasons i have a lot of very detailed information that i would love to share with you as part of the internet archive use case that I'll get into in a few minutes and I didn't want to misrepresent any of the specific details and I think I got this little too close to my face there before I begin I wanted to thank one of my colleagues in the internet archive operations team a gentleman named Ralph Newlin who was able to compile the statistics associated with the data I'm going to share with you we thought it was important to provide a perspective from practitioners sort of rolling up our sleeves in the field trying to deal with some of the challenges associated with data at scale and ongoing power consumption so I think amongst cultural heritage organizations motivations for green digital preservation and sustainability often never emerge or are driven by a wide range of factors to stink to an institution's mission operating budget the amount of control they might have over IT decisions and the overall social political climate in which that organization operates maybe it doesn't I'm a deranged sorry I'll stand all right as David's already highlighted green digital preservation usually isn't at the top of the list of issues facing a library archive or museum keeping pace with the rate of digital data ingest and its creation can be daunting enough motivations to preserve in a sustainable way vary depending upon where you are in the globe how long you've been ingesting born digital content and digitizing physical assets how rapidly your digital collections are growing and the range of preservation and access services you support to those collections there are limits to what any institution can achieve a loan or even in partnership with other like-minded institutions but eventually every organization or program faces a challenge that could include a green choice as a solution many of us in this community tend to be attracted to the promise of preserving more content and making more data accessible for the same investment of resource reducing overall power consumption is not yet a requirement we face doing more with the same amount of power however remains a seductive option of course for some organizations the reputation as an environmentally responsible institution can also be a driver especially if you're investing tens of millions of dollars in green stacks it makes complete sense to attempt to do the same with virtual collections wherever and whenever feasible for centuries now memory institutions and cultures have engaged in preservation of their sacred text and objects using what we now turn green techniques for example hiding them in mountain caves or salt mines placing them into berms that seem to disappear into the side of a hill in some corners of the globe modern digital stewards are doing the same thing today and I want to use an example of our colleagues at the National Library of Norway who opened the mountain vault about 10 years ago now they had in mind both physical and digital collections when they created the vaults and built isolated server rooms deep within the mountain to take advantage of naturally cool conditions and simply been added ventilation to remove the heat produced by spinning drives they've since had to tackle some issues of with humidity but none of those have impacted their ongoing operations they also maintain one copy of all their data on spinning drive today and two copies on tape one copy of tape is actually based in marana the others in oslo in the event of data loss locally they can restore content within minutes and the oslo copy is really serving the purpose of disaster recovery in the event of a total catastrophe in marana itself they're also considering putting an additional replica outside their national borders now this is a really robust example of a memory institution that's been able to lower their long-term cost of data and still provide a significant level of access and they have well over four petabytes of data now sort of assembled within this infrastructure but in general that is is not the case you're going to find a trade-off between sort of lowering your costs and being able to continue to sustain your access or you're going to have some impact to the amount of time it takes and many of us are using techniques to try to address this like those listed on this slide but they're they're trade-offs no matter how you you frame them now there's also much we can learn from other sectors you know whether we're talking about cloud computing providers local governments businesses higher ed and high-tech for-profit and not-for-profits many of these organizations are supporting extensive digital collections and service operations and although individual case studies very the enhancements that many of those organizations have been able implement are still applicable to cultural heritage even though they're often optimized for real-time access versus long-term storage and access things like innovative physical data center facility designs and cooling systems relocation of data to different geographies increases in storage densities and improved air flows increased average temperatures and server rooms sometimes in excess of 85 degrees some examples of commercial efforts that you might want to take a look at that have emerged in the last 3 or 5 years are Yahoo's chicken coop data center model up in New York State Google's data center in hommina Finland or some of the evaporative cooling techniques deploy to facebook google and yahoo now consortium of interested parties called the Green Grid not unlike the NDS a actually in its structure has been defining a body of standard metrics by which to measure usage and efficiencies in data centers including things like power usage efficiencies serve a compute efficiency data center compute efficiencies now pua is the only metric i'm going to speak to in any detail today from the the body of those and it's defined as the total energy needed to operate a facility / the energy needed to operate the IT within the facility now the industry average PU in 2008 was about 1.5 to 2 meaning for every hundred kilowatts of compute power 50 to 100 kilowatts of cooling power were also needed to support those operations now often organizations who embark on efforts to try to reduce their pua are initially motivated by what it can mean for a reduction in operational expenditures meaning you know maybe an increase in overall capacity for the same cost or the operation of services at a fraction of the cost relative to another equivalent facility due to sort of the methods implement it these are the very same motivations that drove the Internet Archive in recent years to make modifications to our data center operations in San Francisco so I'm going to to those now for those of you not familiar with ia we're a digital library established in 1996 were a not-for-profit that maintains over 14 petabytes of publicly accessible data digital content including ebooks digitized text films video audio software Internet content television news i'm going to offer some limited insights into what making green choices might look like for an organization on a relatively small budget and with limited staff but keep in mind we tend to be a little bit wacky some of the things that we choose to do others might go oh my gosh so think of this as sort of inspiration and directional opportunities to experiment with something different so we began under pressure to lower our operational expenditures by reducing our power consumption this came as a mandate in 2010 from our founder brewster kahle we had recently moved into the building that you saw pictured recently and we really wanted to operate differently and in a more sustainable fashion so the key metrics we identified to drive sort of evaluation of our processes was certainly improving our pua as defined by the Green Grid and broader tech communities and reducing kilowatts per petabyte required to support storage and access services to our digital collections we've already talked about pua but I want to spend a quick minute on kilowatts per petabyte now operation costs have been the second highest in our budget behind human resources for our entire history the kilowatt per petabyte metric is a good way for us to measure our needs and to estimate but budget implications for operational choices we are making it i think in at any given point in time so it's a metric we've tracked since 2004 so with a little bit of historic context that we can bring now i'm going to spend just a minute going back in time and giving you a little bit of context in order to understand why we made some of the modifications that we did as of 2010 now we don't have any data prior to two thousand four other than capex that we really tracked them in operational layers so we're going to start our history in 2004 even though our operations go back to 96 so from 2004 to 2010 we rented traditional air-conditioned data center facilities much like what many of you might be familiar with we replicated data locally and globally to foster distributed preservation and we designed and built our own hardware to reduce power consumption and minimize the risk of data loss biasing longevity over compute capacity in fact we cool drives instead of processors and individual nodes now the upsides to this approach was P we fell within industry averages as best as we could estimate at around 1.8 in 2010 this was largely due to increases in storage density and some of the increase in component quality without increases in requirements for kilowatts to operate individual components that occurred during this same time frame in fact over two to three generations of hardware migration we were able to reduce from 117 kilowatts per petabyte down to 39 kilowatts per petabyte not including cooling and that's all for the same stable upfront investment about 120 k USD per rack now the downsides is due to the rental arrangements we had no real control over power consumption in aggregate and optimization of that nor cooling and airflow in our individual facilities and by 2010 we've done everything we could do to optimize Hardware now we needed to try something different in 2008 toward the end of the phase of our operations we actually partnered with sun microsystems this was pre Oracle days and deployed one of the first generation of modular data centers at one of their Santa Clara facilities the promise of this was that it would be cheaper and faster to build out than a traditional data center and it was but pee wee for the installation was estimated to be at the high end of the range of the industry averages at the time and about two we estimated the kilowatt per petabyte requirements at 67 kilowatts per petabyte and subsequent generations of containers had better ratios due to better to higher density and hence lower kilowatt petabyte factors cooling was water-based though so when it came time for us to relocate the container under request by Oracle from Santa Clara to Richmond California we were unable to DS to do so we actually joked internally that if only we were operating swimming pool then we could make a green cost justification that we'd heat the pool and cool the datacenter but alas we don't operate a swimming pool in 2005 we also began experimentation with cloud computing on amazon web services to facilitate data mining and indexing of web objects at significant scale we've retained on average about 30 terabytes of data over a seven-year period and s3 with no data loss or corruption we could not afford at the time because we weren't able to make the data public to sustain the model at petabytes scale but the cloud continues to represent on-demand resources for us for hosting research data sets and for supporting large-scale computational needs without investment in additional infrastructure our largest jobs tend to run for about 24 hours on thousands of nodes at a time but that's become a small job on Amazon if you can believe it so the next set of experiments really came to fruition as i mentioned in 2010 when we moved into our funston ave avenue facility now that's a historic Christian Science church constructed in the 1890s in the inner richmond district of san francisco which is pretty foggy most of the time and for the first time we had complete control over our physical plant however we were handed a set of rules that we had to operate under in order to modify our facilities specifically we were to make a concerted effort to reduce power consumption by measuring actual usage analyzing and evolving solutions not by guessing which sounds simple and of course that's what you'd want to do but there's always a caveat when you try to implement these things racks needed to cohen's coexist with humans we didn't want to hide them we needed to make sure we weren't ruining the historic space with ducks and AC plant in fact there was a mandate for no air conditioning what so ever since we're two miles from the ocean with 49 weeks per year of natural cooling we also decided it was time to use commercially manufactured equipment we should not continue to assemble our own hardware we needed to focus on concor missions and we needed to make it as pretty as the Jedi library so on the hardware front we were able to acquire commercial off-the-shelf servers although we did make some modifications to them we ended up removing some server fans slowing remaining fans we dampened wall covers and we ended up a repurposing heat to actually heat our building operations we're using a super micro on the manufacturing side and king star for assembly and local support in parallel are we were handed the mandate across the internet archive to ensure that we were duplicating all materials upon capture or creation prior to ingest into our catalogue systems so there was a massive effort to deploy an age-based implementation for example to ensure that we were on for ongoing crawling everything was deduplicated in real time in terms of the other mandates we also looked historically like should we go back into our existing data sets and deduplicate back through time we determined that actually it was going to cost us more resource than benefit so we did not actually duplicate back and anything written prior to 2010 is as it was originally written and contains a fair amount of duplication in our collections now we hit a series of pitfalls along the way you probably can't read this but when you go to look at the slides later you'll love this Dilbert it's absolutely hysterical our biggest issue was the power company there were smart meters deployed they were collecting real-time data over a network but they'll be damned if they'd give us access to an API to our energy data so we face the problem of how do we actually take measurements ourselves and how do we afford to do that most of the available solutions were beyond our budgets at the time so Ralph who I mentioned earlier a senior member of our ops team actually built a whole data center real-time networked power meter for less than five hundred dollars he plans to write up and open-source the instructions for how he accomplished that so other institutions can do the same if you're interested once we'd stabilized the physical infrastructure and collection data management policies we turned our attention towards server efficiencies in terms of timing energy use to avoid peak utility windows we also wanted to make general improvements to our productivity for server mode the image which you can't really read on the Left shows an example for monitors our web crawl and index merging operations the first spike you see is a compute job we were able to change routines to smooth out usage and avoid repeats of those intervals the second spike is an incident of weather specifically an increase in outdoor temperature sometimes we can increase natural airflow and reduce attempts as they rise with reduced temps internally as they rise externally but about two to three days per year we'll get external temperatures in excess of 90 degrees and usually they fall in a row unfortunately and there's really nothing to do in these circumstances so it is actually our official policy that if we believe our equipment is at risk we will actually shut it down and in this case green preservation Trump's access now this is only one replica of our collections there may be another data center within our network that might pick up publicly available access or you'll get the friendly message explaining that were unavailable for a period of time now the other thing that we were sorry that the one on the right is actually showing a series of Hadoop job so i'm going to buy only two more slides so i'm going to go a little faster in terms of some of the other things we were able to put in place was more sophisticated automation to manage environmental change we combined software with network devices that to measure and monitor and make data-driven adjustments both virtually and physically and this is an example for example we can look at you know if there's temperature and whether or not a fan is actually operating then power up an alternate fan and alert the humans you know that type of logic is now widely distributed within our infrastructure now the good news is we've seen pretty substantial success over the past three years PUA is usually less than one heats get gets reused within the building not all days do we attain this level but we do so we violate this at our own discretion the kilowatt per petabyte metric has dropped from 8.5 to 2.8 kilowatts per petabyte you can add about a kilowatt if we're running compute VMs and we do some of most notes but not all we're not going to expect many additional efficiencies and hardware optimization we're really looking at other ways of greening in the future our data centers operations are quiet in fact we hear more complaints about the loud human nearby or the fact that there's no place to make a phone call then we ever do about our servers and I wanted to leave you with one more wacky thought so one of the members of our ops team sent me a reference to this project more as a joke than anything but I thought it was kind of a fun way to close out this talk if you haven't heard about server sky it's a concept around putting datacenters in space and sort of a little microcosms I proposed by a guy named Keith left strim up in Oregon and you know especially in context of the raging debate about putting solar planners in the in the desert it seems relevant that we might avoid the debate altogether by just watching out into space Thanks okay good morning so one of the advantages of being the last speaker is that most of the things have already been said here so I will not bore you with the stuff that's already been said instead focus on some of the new points that perhaps need to be brought out okay so I'm going to be talking about sustainability so there's obviously a natural conflict between sustainability in preservation so in some sense i'll be talking not so much about preservation but how you can get away without preserving in some sense so let's see all right so preservation obviously has a significant environmental footprint and some of those things have already been talked about there is an issue of storage media life we go disks for example last for three to four years you have to replace them and what you do with all the garbage that you generate there is obviously the question of housing in access management of the media there is the robot that you see there picking out the tapes and then of course the data centers which you need for processing and for online storage okay so let me start out by asking the question what do we want to preserve most certainly we don't want to preserve data we want to preserve information we want to preserve data that is useful in fact what we would ideally like to preserve is the knowledge of course the real challenge is how do you preserve knowledge or the kind of knowledge that you might need to derive without preserving the raw data okay now you can do that to some extent you can throw out parts of the data that now not interesting but in general that's a huge challenge okay so I'm going to talk about 36 inability issues so first of all as has been talked about many times this morning that it's not enough that you just store the data well the real value of the data is what you can do with it and what kind of insights you can gain from the data and therefore the processing is very important in there and as a result if you are talking about sustainability you want to reduce the environmental impact of the data centers the second one is has to do with the data itself if you look at the data in the wild out there there's a lot of junk there and the question is can be rid of that junk and preserve only what we really need to preserve and that the third one is there is obviously a trade-off between keeping everything and processing everything from scratch versus you keeping is partially processed data and perhaps not necessarily the original data and there are obviously sustainability trade-offs in terms of computing impact versus the storage impacts and so on okay so you have already heard a lot about data centers I will not spend a whole lot of time on data center in particular I will not talk a lot about the power aspect of the data centers we already know that the data centers consume about two percent of their total electricity a lot of the power gets wasted and so on and so forth its continues to increase however that's not the only aspect that we need to worry about in addition to power power wealth data centers are physical entities they use a lot of materials they use a lot of water there are manufacturing costs so if you look at data centers in terms of their overall environmental impact for example the carbon impact well there is lot more impact than just energy that you consume in processing and that's why it's it's a good idea to not just look at the energy but look at the entire environment impact of the data center in fact the energy itself really does not matter it's so much what really matters is the carbon footprint imagine for a moment let's say that your data center was being entirely operated on renewable sources which had low carbon impact okay in that case well it's not so much how much energy you use but ultimately how much is the carbon impact okay let me give you some quick examples so if you look at the power distribution infrastructure in a data center starting from the transmission lines all the way up to the point where you can actually use the power well there is a lot of infrastructure there and it has carbon impact okay and what if you could reduce that and so there is this idea which has been which is being investigated quite actively these days my research also includes set which is that use the renewable energy as opposed to just using energy from the grid well what are the challenges the renewable energy often is variable and therefore there might be times where you don't have enough energy available then what do you do so so to do that well you need some adaptation you should be able to adapt your data center to the available energy cooling infrastructure again huge environmental impact so has already been talked about then you operate the data center without any cooling you operate them using your ambient air for example but that of course has a challenge that there might be things that might be overheating and as a result you cannot do all the processing that you want to do so there is a question of adaptation there over design this is another that's often overlooked we love to over-design things just to be safe so for example if you need a thousand watts in your server you will probably buy not a thousand watt power supply but probably a two thousand or even a 3000 watt power supply just in case and what happens when you do that is that you are operating lot of these devices and systems at much lower utilization than their full capacity and they tend to behave very very inefficiently so for example a power supply when its operating at twenty percent of its capacity will probably give you only fifty percent efficiency whereas if you were to operate it near eighty percent of its capacity will be more like eighty-five percent efficient took it ok so again if you were to cut down this over design then you need adaptation for periods where you run out of capacity so this led to this notion of energy after computing that I have been looking at for last couple of years and the idea is that you replace over design with the right sizing along with a smart adaptation and there are many instances of this you can do this within a data center you can do it among the clients you can do it for the entire infrastructure now a lot of challenges in being able to do this obviously you want to be able to maintain the processing in the quality of service requirements and at the same time try to minimize the carbon impact ok so that's all I'm going to say I will not bore you with any research results a lot of results that are available some of them some of the papers you can see on my website but basically that's the whole idea that we have been looking at the last couple of years that if you look at this from this energy adaptation perspective and coordinating the use of energy across different systems and subsystems you can actually significantly reduce the environmental impact of processing so let me switch over to another topic again a topic that has been talked about well the data has been growing obviously it's not only the data generation rate that has been growing but also the total amount of accumulated data has been growing at an exponential rate however if you look at this data what turns out is that more data you have perhaps a more junk in that data so the amount of useful information okay that you have in the data is going down and what can we do about this well first challenge is how do you define what is useful it's always a challenge but still there are a number of things you can do now along with that there are come challenges in terms of increased power consumption and all the disk drive that you need to throw out and and then the cumulative impact because people is once they store the data they don't want to delete it which is not sustainable okay so the number of opportunities data reduction you have opportunities within individual objects objects being your webpages files databases and so on you have opportunities at the level of your administrative domain which could be your house your office your business and so on and all across the administrative domains okay um so a number of options available compression compressive sampling basically keeping the samples that you really need Delta encoding if you already have some old data what is the difference from the old data very often people bundle the vm along with the data so that all you have to do is just transfer that somewhere and start running the vm so there's a lot of duplication there you can do a lot of D complication across the objects within administrative domain of course across administered to means it becomes a lot more challenging now in doing all of this a number of trade-offs is a trade-off of storage versus data movement which takes energy and of course time moving petabytes of data is not easy in fact you don't want to do it and versus processing there is issues of fidelity versus cause you could have reduced representations which will be more effective from a sustainability perspective but then there is a question of what are you losing and then there are issues of excess privacy and security across two means okay finally I will also like to talk a little bit about the role of the content creators and we are all content creator we all behaved very badly okay and me we can use some best practices to really reduce the growth of data so we just take a file and just blindly send it to 10 of our friends and they just download this file and then forget all about it whether or not they need it that file is always there okay so that's something that I guess vs content creators can address to a large extent purge obsolete data are defective data unneeded data and then finally the issue is of metadata and that has been talked about this morning well in some sense metadata is more important than data but there might be standards for metadata but most people don't bother like it was said in the morning that they will just fill some of the entries and not really bother with it and that really reduces the value of the data and the value of preserving the data okay and that's the end of my presentation well this is what well I want to thank all of our presenters one last time and I think we do have some time for Q&A and I will just sort of kick start the conversation but then I'm going to hand it off to all of you to pick it up in whatever direction you would like to take it so I think actually pretty I think you raised some of the most complex issues for I think that are on the minds have a lot of people in the room in terms of how do we value our data and how do we assess it how do we think about issues such as selection and perhaps even reduction of our digital collections so I wanted to give you know Kristen and David an opportunity to kind of perhaps respond to some of those issues and and perhaps that will kick-start s.f we had this conversation before I i drew a graph I'm showing the the effect of the I IDC's projected sixty percent a year compound growth in in data and the twenty percent projected cost of a 20-percent Kreider rate and the one percent projected growth in IT budgets which projected out that if if this is true then 10 years from now storage is going to be consuming everyone's entire IT budgets without anything for processing it so I we've been lulled into a false sense of security by the very high Kreider rates in the past and storage going forward is going to be a lot less free than it used to be and so we are going to have to take some tough decisions about what we decide to keep and it's better to take those decisions up front then run into the wall and then have to spend money to get rid of data that you can no longer afford to store because you don't have the money so I think in terms of I'll use one example in sort of our web wide carlin context we actually went so far as to hire a body of interns college students that were willing to go in and look at the highest most sort of data volume intensive resources that we were crawling to confirm which fell into certain categories of content versus others and we took the top I think it was twenty thousand resources and we did that because there's some subset of content that ends up being mirrored in lots of corners of the web and we can get a representative sample of that type of material and and what the web was like today from a researchers perspective but we don't need to have all of that from from every restore so we're trying to come up with innovative methodologies to start to pare down and still have representative samples of everything that's out there that can be aggregated together but not go so far as to suggest that we're covering everything at all depths and all breaths they're questions in the audience let me make one point so I guess when we think about raw data there is this somewhat of a religious idea that we need to keep the raw data all the time and that's not necessarily the case now you could always come up with a counter example that if you filter the data in some way and someday somebody will come up with a question that you could not answer and that becomes a kind of a fear and because of that fear we say that okay let's not worry about it let's just do the preserve the raw data and then we can do the deep learning on it or whatever that we need I think we have to somehow get away from that mindset and start thinking about giving the type of data what are the interesting kind of questions then one could possibly ask now I know that's not something that can be easily characterized but we need to go in that direction and ask those types of questions but has built some models as to how we can perhaps capture at least in a theoretical sense that if we keep maybe say twenty-five percent of the data in a particular way then you can answer maybe ninety nine percent of the questions and if you can get there even approximately I think it will be a huge progress of course that if you look at what the really high data rate scientific experiments like the Large Hadron Collider or the Square Kilometre Array and so on they do massive data reduction before they store anything and that's for two reasons one of them is they can't buy the storage bandwidth that they need and the other is they can't afford to keep that stuff for long enough to be interesting and so in those fields that's already happening in it's much less easy to do that in the humanities because you don't you don't have that depth of understanding of what the data really is but at the same time what we see this morning with a chrome crawl talk and and what the internet archive's doing is that in the humanities having a random sample of what's going on is often as useful and as as having the whole thing and almost as useful and this is actually the tradition with archival materials archivists normally collect a few percent of the stuff that arrives in boxes and we need to translate those techniques into the the work that we're doing in the digital space because we're not going to be able to afford to keep everything and there are no technological fixes for that on the horizon you where do you see that the gaps in our sort of rd knowledge like where would you like to see the fields go in terms of addressing some of these issues it sounds like there's some consensus here on the stage about thinking about the ingestion of data up front perhaps automating that ingestion sort of conducting the selection at that early stage where you know there's a whole room full of preservation it's here and and where do you think that that that you know some of those areas could be examined in research further for industry I would say that you always start with the requirements and if we can specify the requirements in terms of what kind of fidelity do you need for example or what are your requirements in terms of what is it that you want to preserve and I suppose the answer is not that you want to preserve data as I said earlier the answer should be that you want to preserve something derived from the data some sort of a knowledge can be specified that and I don't think right now we have the mechanisms to really even talk about it in an intelligent way but can we begin to talk about it about these requirements have a language for expressing those requirements and once we are there then perhaps we can think about how do we take those requirements and then work backwards and come up with techniques which will be able to reduce the data and still preserve those quadrants well I think there's a little bit of a chicken and egg problem I mean Lisa pointed this out this morning that we're so focused on collecting and storing and preserving we don't always have the benefit of putting the compute capacity close to our storage and mining and analyzing at a machine scale that gives us insights into what else we might do differently and I think many of us operate a little bit from a place of fear well I better keep that because I have no idea when I'm going to be able to put it in a useful spot and somebody might do something amazing with it so I do think to the extent that we can invest in the discipline of data science as a community and really foster the development of that type of scholarship make our archives available if not within our four walls but you know outside of that in context where researchers can have more interaction I think we'll learn immense amounts about where we need to go from here and hopefully get out of the cycle of what we darn well better preserve everything because we don't know what we might want to do with it later I think that's that's a good place to end on so I'd like to thank our presenters one last time and I'm sure they're going to be around for further questions so feel free to come find me or one of our presenters if you have further questions this has been a presentation of the Library of Congress visit us at loc.gov

Loading