Wednesday, July 29, 2015

BitCurator 1.5.1 VM and ISO released

BitCurator 1.5.1 VM and ISO released. Bitcurator Group. July 21, 2015.
The latest release of BitCurator is now available.The Bitcurator wiki has BitCurator 1.5.1 VM and the BitCurator 1.5.1 Installation ISO. It contains:
  • Create forensic disk images: Disk images packaged with metadata about devices, file systems, and the creation process.
  • Analyze files and file systems: View details on file system contents from a wide variety of file systems.
  • Extract file system metadata: File system metadata is a critical link in the chain of custody and in records of provenance.
  • Identify and redact sensitive information: Locate private and sensitive information on digital media and prepare materials for public access.
  • Locate and remove duplicate files: Know what files to keep and what can be discarded.
It also contains a new bootstrap and upgrade automation tool, and support for USB 3.0 devices.

Related posts:

Tuesday, July 28, 2015


detect-empty-folders. Ross Spencer. Github. 22 July 2015.
A tool to detect empty folders in a DROID CSV. A blacklist allows you to simulate the deletion of non-record objects, which may render a folder empty.  The heuristics used here can be implemented in any language; this tool is in Python.

Related posts:

Storage Trends Around Computex 2015

Storage Trends Around Computex 2015. Tom Coughlin. Forbes. June 8, 2015.     The 2015 Computex Conference attracted many digital storage vendors. There were announcements about flash-based storage products, new memory products and optical archives.  CMC Magnetics said that it is selling 100 GB Blu-ray optical discs to Facebook for archiving applications. The company expects other internet service companies to follow suit. In May, Sony announced that it was buying Optical Archive, a start-up created by former Facebook executive. Sony is making a big push to create digital archiving solutions using Blue-ray disc technology and the acquisition is seen as an extension of this effort.

Related posts:

Monday, July 27, 2015

Now available: 100 GB capacity M-Disc

Now available: 100 GB capacity M-Disc. Press release. Millenniata. June 2015.
     The new 100 GB Blu-ray M-discs are now available. The new disc has all the features of the  original M-Disc. Previously the products were the standard DVD and a 25 GB Blu-ray. Archiving large data sets is now much more convenient.
[The new 100 GB discs, which completely sold out in a short time, are now available again.]

Related posts:

Researchers Open Repository for ‘Dark Data’

Researchers Open Repository for ‘Dark Data’. Mary Ellen McIntire. Chronicle of Higher Education.  July 22, 2015.
     Researchers working to create a one-stop shop to retain data sets after the papers they were produced for are published. The DataBridge project will attempt to expand the life cycle of so-called dark data by creating an archive for data sets and metadata, and will group them into clusters of information to make relevant data easier to find. They can then be reused, re-purposed, and then be reused by others to further science. A key aspect of the project will be to allow researchers to make connections pull in other data of a similar nature.

The researchers want to also include archives of social-media posts by creating algorithms to sort through tweets for researchers studying the role of social media. This could save people time who may otherwise spend a lot of time cleaning their data reinventing the wheel. The project could serve as a model for libraries at research institutions that are looking to better track data in line with federal requirements and extend researchers’ “trusted network” of colleagues with whom they share data.

Related posts:

Friday, July 24, 2015

Announcing the ArchivesDirect Price Drop

Announcing the ArchivesDirect Price Drop: Affordable Preservation, Evaluation and Workflows Plus DuraCloud Storage. Carol Minton Morris. DuraSpace. July 21, 2015.
     The ArchivesDirect hosted service from Artefactual Systems and storage in DuraCloud now has reduced pricing. This price includes a hosted instance of Archivematica, training and replicated DuraCloud storage (with a copy in Amazon S3 and one in Amazon Glacier).

The subscription plans are:
  1. Assessment. A three month plan with 500 GB of storage. Cost: $4,500
  2. Standard. An annual plan with 1 TB of storage. Cost: $9,999
  3. Professional. A custom plan. Cost: not available.

Thursday, July 23, 2015

First Large Scale, In Field SSD Reliability Study Done At Facebook

First Large Scale, In Field SSD Reliability Study Done At Facebook. Adam Armstrong. Storage Review. June 22, 2015.
    Carnegie Mellon University has released a study titled “A Large-Scale Study of Flash Memory Failures in the Field.” The study was conducted using Facebook’s datacenters over the course of four years and millions of operational hours. The study looks at how errors manifest and aim to help others develop novel flash reliability solutions.
Conclusions drawn from the study include:
  • SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips.
  • The effect of read disturbance errors is not a predominant source of errors in the SSDs examined.
  • Sparse data layout across an SSD’s physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns.
  • Higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques.
  • The amount of data reported to be written by the system software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques.
The study doesn’t state one type of drive is better than another.

Related posts:

Digital Preservation Business Case Toolkit

Digital Preservation Business Case Toolkit. Jisc / Digital Preservation Coalition. May 2014.
     A comprehensive toolkit to help practitioners and middle managers build business cases to fund digital preservation activities. It includes step by step guide to building a case for digital preservation, such as the key activities for preparing, planning and writing a digital preservation business case. It includes templates, case studies, and other resources to go in a chronological order through the step needed when constructing a business case.

The key activities include
  1. Preparation; look at timing, your organization's strategy, and what others are doing 
  2. Audit your organization's readiness and do a risk assessment
  3. Assess where you are and what you need, your collections, your organizational risks
  4. Think hard about your stakeholders and intended audience
  5. Decide your objectives for your digital preservation activity and define what you want to achieve
  6. List your digital preservation benefits and map to your organization's strategy
  7. Look at benefits, risks and cost benefit analysis 
  8. Validate / refine your business case; Identify weaknesses and gaps in your business case
  9. Deliver your business case with maximum impact; Create an Elevator Pitch, so you have the right language ready to make your case to potential advocates in your organization. 
The elements of the template include:
  • The key features of the business case and a compelling argument for what you want to achieve.
  • Decide where you want the plan to be by a specific time
  • The key background and foundational sections of your business case, a focus on your digital assets and an assessment of the key stakeholders, and the risks facing your digital assets.
  • A description of the business activity that your business case will enable.
  • The possible options along with an assessment of the benefits, and associated costs and risks. 

Wednesday, July 22, 2015

Information Governance: Why Digital Preservation Should Be a Part of Your IG Strategy

Information Governance: Why Digital Preservation Should Be a Part of Your IG Strategy. Robert Smallwood. AIIM Community. July 6, 2015.
     The post looks at Information Governance and digital preservation. The post author wrote the first textbook on information governance (IG). He used key models as part of this, such as the  Information Governance Reference Model (IGRM), E-discovery Reference Model and the OAIS model.  The question to answer is whether or not long term digital preservation should be a part of a information governance strategy.

Information Governance is defined as: 
a set of multi-disciplinary structures, policies, procedures, processes and controls to manage information at an enterprise level that supports an organization's current and future regulatory, legal, risk, environmental and operational requirements. 
  • "Long term digital preservation applies to digital records that organizations need to retain for more than 10 years."
  • digital preservation decisions need to be made early in the records lifecycle, ideally before creation.
  • Digital preservation becomes more important as repositories grow and age.

"The decisions governing these long term records - such as digital preservation budget allocation, file formats, metadata retained, storage medium and storage environment - need to be made well in advance of the process of archiving and preserving."

"All this data - these records - cannot simply be stored with existing magnetic disk drives. They have moving parts that wear out. The disk drives will eventually corrupt data fields, destroy sectors, break down, and fail. You can continue to replace these disk drives or move to more durable media to properly maintain a trusted repository of digital information over the long term."

If you move to a cloud provider that makes preservation decisions for you, then "you must have a strategy for testing and auditing, and refreshing media to reduce error rates, and, in the future, migrating to newer, more reliable and technologically-advanced media."

Your information governance strategy is incomplete if do not have a digital preservation strategy as well. Your organization "will not be well-prepared to meet future business challenges".


Arkivum: Long-term bit-level preservation of large repository content

Arkivum: Long-term bit-level preservation of large repository content.  Nik Stanbridge. Arkivum. DSpace User Group. 16 June 2015. [PDF slides]
     Based on the Principles of ‘Active Archiving’, which is replication, escrow, and integrity
checking. Trying to preserve content for longer than 25 years. Principles based on diversity, intervention, with different technologies and locations.
  • Adding media, a continual process
  • Monthly checks and maintenance updates
  • Annual data retrieval and integrity checks
  • 3-5 year obsolescence of servers, operating systems and software.
  • Tape format migration
Integration with DSpace. It has ISO 27001 validated processes and procedures and is designed for bit level preservation for large volumes of data.

Similar posts:

Tuesday, July 21, 2015

File identification tools, part 8: NLNZ Metadata Extraction Tool

File identification tools, part 8: NLNZ Metadata Extraction Tool. Gary McGath. Mad File Format Science Blog.  July 10, 2015.
     This tool is for extracting metadata from files. It uses some basic tests to determine the format and then it looks at the following file formats: 
BMP, GIF, JPEG TIFF, MS Word, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint, PDF, WAV, MP3, BWF, FLAC, HTML, XML, and ARC. 
The Java tool is available as open source on SourceForge. There are command line versions for Unix and Windows. [This tool is available to use in Rosetta.]

Related posts:

Toolkit for Managing Electronic Records

Toolkit for Managing Electronic Records. National Archives and Records Administration. May 13, 2015 updated.
The Toolkit for Managing Electronic Records is a spreadsheet that provides descriptions of a collection of guidance products for managing electronic records. It includes tools and resources that have been developed by NARA and other organizations. The separate tabs can be sorted or searched as needed.

Monday, July 20, 2015

Open Source Tools for Records Management

Open Source Tools for Records Management. National Archives and Records Administration. March 18, 2015. [PDF, 22pp.]
      NARA has identified open source tools that could be used for records management, but it does not include proprietary free software tools. Security is a concern with some implementations of open source tools.  These are  neither tested nor endorsed by NARA. The list of tools is approximately 18 pages; the tools address functionality, such as:
managing workflows; identifying duplicates; extracting and managing metadata; handling email archives; web publishing; data analyzing; working with PDF files; preservation planning; scanning files; identifying confidential data; file renaming; web archiving; comparing web pages; document managing; format identification; file integrity; image processing; and natural language processing.

The document also contains lists of other resources and tools.

Related posts:

File identification tools, part 7: Apache Tika

File identification tools, part 7: Apache Tika. Gary McGath. Mad File Format Science Blog.  July 1, 2015.
     Apache Tika is a Java-based open source toolkit that can identify a wide range of formats and extract metadata from others. It doesn’t distinguish variants as much as DROID. Plugins can be added for formats that it does not regularly support.

Related posts:

Saturday, July 18, 2015

Library of Congress Recommended Formats Statement 2015 - 2016

Library of Congress Recommended Formats Statement 2015 - 2016. July 15, 2015.
     The Library of Congress, as part of it ongoing commitment to to digital preservation, has provided the 2015-2016 version of the Recommended Formats Statement, and is seeking comments for next year's version. (The article in the Signal reviews some of the changes.)

There have been changes to the content, the layout, and also they have changed the name to from "Specifications" to "Statement".
"The Statement provides guidance on identifying sets of formats which are not drawn so narrowly as to discourage creators from working within them, but will instead encourage creators to use them to produce works in formats which will make preserving them and making them accessible simpler."

Related posts:

Friday, July 17, 2015

Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase One report - July 2015

Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase One report - July 2015. Jenny Mitcham, et al. Jisc Report. 14 July 2015.
     Research data is a valuable institutional asset and should be treated accordingly. This data is often unique and irreplaceable. It needs to be kept to validate or verify conclusions recorded in publications. Preservation of the data in a usable form may be required by the research funders, publishers, or  universities. The research data should be preserved  and available for others to consult  after the project that generated it is complete.This means the research data needs to be actively managed and curated. "Digital preservation is not just about implementing a good archival storage system or ‘preserving the bits’ it is about working within the framework set out by international standards (for example the Open Archival Information System) and taking steps to increase the chances of enabling meaningful re-use in the future."

Accessing research data is clearly already a problem for researchers when formats and media become obsolete. A 2013 survey showed that 25% of respondents had encountered the “Inability to read files in old software formats on old media or because of expired software licences”. A digital preservation program should address these issues. Open Archival Information System and it uses standards such as PREMIS and METS to store metadata about the objects that are being preserved.  A digital preservation system, such as Archivematica recommended in the report, would consist of a variety of different systems performing different functions within the workflow. "Archivematica should not be seen as a magic bullet. It does not guarantee that data will be preserved in a re-usable state into the future. It can only be as good as digital preservation theory and practice is currently and digital preservation itself is not a fully solved problem."

Research data is particularly challenging from a preservation point of view because of the many data types and formats, many of which are not formats that digital preservation tools and policies exist for, thus they will not receive as a high a level of curation when ingested into Archivematica.
The rights metadata within Archivematica may not fit the granularity that would be required for research data. This information would need to be held elsewhere within the infrastructure.

The value of research data can be subjective and difficult to assess and there may be disagreement on the value of the data. However, the bottom line is "in order to comply with funder mandates, publisher requirements and institutional policies, some data will need to be retained even if the researchers do not believe anyone will ever consult it." Knowing the types of formats used is a key to digital archiving and planning, and without that there will be problems later. In the OAIS Reference Model, information about file formats needs to be part of the ‘Representation Information’ that an end user must have to open and view a file.

Thursday, July 16, 2015

Toshiba's 3-D Magnetic Recording May Increase Hard Disk Drive Capacity

Toshiba's 3-D Magnetic Recording May Increase Hard Disk Drive Capacity. Tom Coughlin. Forbes. July 9, 2015.
     Toshiba demonstrated a method for using magnetic fields from microwave radiation to reverse the magnetization direction selectively in a multi-layer magnetic recording media. This could lead to the development of 3-D magnetic recording where independent data is written and read from overlapping layers of a multilayer recording media and substantially increase hard disk drive capacity. "Magnetic storage is where most of the world’s accessible digital data now lives."

Wednesday, July 15, 2015

A Compressed View of Video Compression

A Compressed View of Video Compression. Richard Wright. Preservation Guide. 22 June 2015.
   Digital audio and digitised film can also be compressed, but there are particular issues. The basic principle is that audio and video signals carry information, though the efficiency may vary. "The data rate of the sequence of number representing a signal can be much higher than the rate of information carried by the signal. Because high data rates are always a problem, technology seeks methods to carry the information in concise ways." The video signal has been altered in order to squeeze it into limited bandwidth. Redundant data  may be sent in the signal to improve the odds that the information will be transmitted. It is important to know what matters and what can be discarded. With preservation, "a key issue is management: knowing what you’re dealing with, having a strategy, monitoring the strategy, keeping on top of things so loss is prevented." Basic principles of preservation also apply to compression:
  • Keep the original
  • Keep the best
  • Do no harm
 There are best practices in dealing with compressed materials, and in migrating compressed versions to new compressed versions. His estimate is that with storage costs decreasing "there will be no economic incentive for such a cascade of compressions." "The next migration will dispense with the issue by migrating away from compressed to lovely, stable uncompressed video."

Tuesday, July 14, 2015

Seagate Senior Researcher: Heat Can Kill Data on Stored SSDs

Seagate Senior Researcher: Heat Can Kill Data on Stored SSDs.  Jason Mick. Daily Tech. May 13, 2015.
   A research paper by Alvin Cox, a senior researcher, warns that those storing solid state drives should be careful to avoid storing them in hot locations. Average "shelf life" in a climate controlled environment is about 2 years but drops to 6 months if the temperature hits 95° F / 35° C. It also says that typically enterprise-grade SSDs can retain data for around 2 years without being powered on if the drive is stored at a temperature of 25°C / 77°F. For every 5°C / 9°F increase, the storage time halves.  This also applies to storage of solid-state equipped computers and devices. If only a few  sectors are bad it may be possible to repair the drive.  But if too many sectors are badly corrupted, the only option may be to format the device and start over.

A Large-Scale Study of Flash Memory Failures in the Field

A Large-Scale Study of Flash Memory Failures in the Field. Justin Meza, et al. ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. June 15-19, 2015.
     Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. "Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability." This can lead to data loss.

This is the first large-scale study of actual flash-based SSD reliability and it analyzes data from flash-based solid state drives at Facebook data centers for about four years and millions of operational hours in order to understand the failure properties and trends. The major observations:
  1. SSD failure rates do not increase monotonically with flash chip wear, but go through several distinct periods corresponding to how failures emerge and are subsequently detected, 
  2. the effects of read disturbance errors are not prevalent in the field, 
  3. sparse logical data layout across an SSD's physical address space can greatly affect SSD failure rate, 
  4. higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and 
  5. data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells
The findings will hopefully lead to other analyses and flash reliability solutions.

Monday, July 13, 2015

Risk Assessment as a Tool in the Conservation of Software-based Artworks

Risk Assessment as a Tool in the Conservation of Software-based Artworks. Patricia Falcao.
   The article looks at the use of risk assessment methodologies to identify and evaluate vulnerabilities of software-based artworks. Software-based art is dependent on technology. Two consequences of this:
1. Because electronic equipment is usually mass-produced, there are very few cases where one individual device is essential for the artwork.
2. On the other hand, when the equipment is no longer commercially available it becomes very difficult to replace any of its elements.
A sculpture conservator may be able to re-make a missing element for a sculpture by using the same or similar materials but a time-based media conservator cannot always re-make obsolete electronics. A particular artwork may use custom-made software. "Any software, in turn, usually requires a specific operating system. All the programs, from the firmware to the operating system, must run properly. All settings, plug-ins, and codecs must be in place. Without all of these, there is no artwork."

This means that each artwork is a custom-made system; the components may vary with each iteration of the work and as technology changes. A conservator understand "how these components are used in the particular system and how they influence the risks and options for long-term preservation." With the conservation of contemporary art,  obsolescence only affects an artwork once something stops working. But the effect of obsolescence will increase over time. 

Software-based artworks are similar to time-based media, bu they are more vulnerable to those risks because:
  1. Systems are customized for each artwork.
  2. Systems are easily changed, so that connecting a archival computer to the Internet could cause it to run an automatic update that causes the file will no longer run. 
  3. The technical environment is rapidly changing.
The degree of significance can be evaluated by
  1. Provenance 
  2. Rarity or representativeness 
  3. Condition or completeness 
  4. Interpretive capacity
A procedure for the acquisition of software-based artworks being developed is composed of simple actions during acquisition that can diminish the impact of obsolescence in the medium-term. It is important to discuss the artwork, technology, and possible preservation measures with the artists and technical staff.  The conservator should identify and define:
  1. The display or presentation parameters 
  2. What can or cannot be changed, and within what limits. 
  3. Identify obsolescent elements and create a plan for recovery.  
  4. How the artist wants the artwork preserved. Identify core elements and migration strategy.  
  5. Understand the system (hardware, software). Test the system with the artist and staff. 
Over the lifetime of the artwork,
  1. Document the system and any changes over time  
  2. Prevent changes such as automatic updates
  3. Monitor obsolescence issues with the components of the work.  
  4. Re-evaluate preservation needs regularly. 
Some steps that can be taken to reduce the failure and obsolescence
  1. Make clones of the computer’s hard drive immediately upon acquisition. 
  2. Create an exhibition copy of the system, possibly with the artist and staff. 
  3. Gather operation manuals, service manuals, and hardware specifications. 
  4. Save the software versions, source code, libraries, and programming tools necessary to read project files. 
For long term preservation,
  1. Continue to implement the preservation strategies identified.
  2. Develop clear procedures for the acquisition of software-based artworks. 
  3. Identify software tools useful for preservation. 
  4. Test recovery strategies and confirm results over time. 
  5. Develop relationships with experts in the fields required for preservation. 
Software-based preservation will require more than just the conservator. It will also require help from the technology field and many tools.

Friday, July 10, 2015

Track the Impact of Research Data with Metrics; Gauge Archive Capacity

How to Track the Impact of Research Data with Metrics. Alex Ball, Monica Duke.  Digital Curation Centre. 29 June 2015.
   This guide from the DCC provides help on how to track and measure the impact of research data. It provides:
  • impact measurement concepts, services and tools for measuring impact
  • tips on increasing the impact of your data 
  • how institutions can benefit from data usage monitoring  
  • help to gauge capacity requirements in storage, archival and network systems
  • information on setting up promotional activities 
Institutions can benefit from data usage monitoring as they:
  • monitor the success of the infrastructure providing access to the data
  • gauge capacity requirements in storage, archival and network systems
  • create promotional activities around the data, sharing and re-use
  • create special collections around datasets;
  • meet funder requirements to safeguard data for the established lifespan
Tips for raising research data impact
  • deposit data in a trustworthy repository
  • provide appropriate metadata
  • enable open access
  • apply a license to the data about what uses are permitted
  • raise awareness to ensure it is visible (citations, publication, provide the dataset identifier, etc)

Thursday, July 09, 2015

Respected US professor says libraries are places of knowledge creation and librarians our educators.

Respected US professor says libraries are places of knowledge creation and librarians our educators. CILIP . 2 July 2015.
  • R. David Lankes: librarians have the power to change the world by “promoting informed democracy”.
  • “Libraries are not about books, and librarians are not about collections, nor are they about waiting to serve. Our libraries are mandated, mediated spaces owned by the community, and librarians are educators dedicated to knowledge creation who exist to unleash the expertise held within their community.”
  • There is a need for a skilled workforce to properly understand and manage information
A new innovation also mentioned is the Ideas Box , a durable, portable library in a box that is designed to provide access to vital information and culture in humanitarian crises. Pioneered by Bibliothèques Sans Frontières/Libraries Without Borders, it can be sent to refugee camps and other remote populations anywhere in the world and set up in under an hour."
ideasbox img1

Wednesday, July 08, 2015

Audiovisual Preservation: Sustainability is Paramount

Audiovisual Preservation: Sustainability is Paramount. David Braught. Crawford Media Services. July 6, 2015.
   Many organizations want the most pristine, uncompressed, high quality files possible. That may seem to make sense, but that is usually unrealistic for most organizations. The storage costs to store the massive files can "lead to paralysis in your digital initiatives and to significant long-term data loss (owing to lack of funds for digitization and archival storage upkeep)." While this may be the best way for some, don't automatically assume there is only one right way of audiovisual digitization. There are many options, file types, and organizational factors.  An important part of this is to define your primary goal. The web site includes a chart that shows estimated information about a project:

File Type Bitrate (Mbps) Total Footprint for 500 Hours (Terabytes)
4K DPX (no audio)   9830 2,109.29 TB
2K DPX (no audio)   2400    514.98 TB
Uncompressed 10 Bit HD   1200    257.49 TB
Uncompressed 8 Bit HD     952    204.28 TB
Uncompressed 10 Bit SD     228      48.92 TB
Uncompressed 8 Bit SD     170      36.48 TB
Lossless Jpg2K 10 Bit HD     445      95.49 TB
Lossless Jpg2K 8 Bit HD     330      70.81 TB
Lossless Jpg2K 10 Bit SD       85      18.24 TB
Lossless Jpg2K 8 Bit SD       65      13.95 TB
DV25       31        6.65 TB

Storage costs are "neither cheap nor long term". You can't just put files on a hard drive and expect them to survive indefinitely. A long term solution requires redundant, archivally sound storage that is  migrated to newer storage every five years. "It does no one any good to ingest thousands of hours of 4K scans and then have to pull the plug on the storage fifteen years down the line. Sustainability should always be paramount."  Each institution has to decide what is the best option for them.

Tuesday, July 07, 2015

With The Rise Of 8K Content, Exabyte Video Looms

With The Rise Of 8K Content, Exabyte Video Looms. Tom Coughlin. Forbes. June 25, 2015.
   Digital storage has an important role in the professional media and entertainment industry. The ever growing archive of long-tail digital content and increasing digitized historical analog content is in increasing demand for archives using tape, optical discs and hard drive arrays. There has been a noticeable increase in 8K content. " It is expected that single video projects generating close to 1 Exabyte of raw data will occur within 10 years." A recent survey on digital storage in media and entertainment showed important for digital storage trends in the area.

Those using cloud storage:
  • 2015: 30.2% of participants
  • 2014: 25.6% 
  • 2013: 24.7%
  • 2012: 15.1%  
 Those with over 1 TB in the cloud:
  • 2015: 32.9%
  • 2014: 28.1% 
  • 2013: 23%
  • 2012: 26.7%
Some other results from the survey concerning archiving:
  • 34% had over 2,000 hours of content in a long term archive in 2015
  • 26.9% added over 1,000 hours to their archive in 2015
  • 32.6% had over 2,000 hours of unconverted analog content in 2015
  • 42.8% said they have an annual analog conversion rate of 2% or less (4.5% was average)
Types of storage media in 2015:
  • Digital Tape: 40% 
  • External Hard Disk Drives: 28%  
  • Disk-based Local Storage Networks: 16%.
  • Optical discs: 6%.
  • Public cloud: 5%

In 10 Years A Single Movie Could Generate Close To 1 Exabyte Of Content

In 10 Years A Single Movie Could Generate Close To 1 Exabyte Of Content. Tom Coughlin. Forbes. October 5, 2014.
   Storage requirements for images and video are increasing.  "In the near future, several petabytes of storage may be required for a complete digital movie project at 4K resolution.  By the next decade total video captured for a high end digital production could be hundreds of PB, even approaching 1 Exabyte." A recent survey shows that overall cloud storage for media and entertainment is expected to grow 37 times  (322 PB to 11,904 PB) and cloud storage revenue will exceed $1.5 billion by 2019.
  • The largest demand for storage is for digital conversion and preservation (including archiving of new digital content - 96.5%).  
  • Archiving and preservation in 2013 was about 47% of the total storage revenue. Active archiving will drive increased use of storage for long term archives.
  • By 2019 it is expected that 64% of archived content will be in near-line storage, up from 43% in 2013.
  •  Over 50 Exabytes of digital storage will be used by 2019 for digital archiving and content conversion and preservation

Cloud Storage Revenue

Collection, Curation, Citation at Source: Publication@Source 10 Years On

Collection, Curation, Citation at Source: Publication@Source 10 Years On. Jeremy G. Frey, et al. International Journal of Digital Curation. Vol 10, No 2, 2015.
   The article describes a scholarly knowledge cycle which says the accumulation of knowledge is based on the continuing use and reuse of data and information. Collection, curation, and citation are three processes intrinsic to the workflows of the cycle. The currency of collection, curation, and citation is metadata."Policies should recognize that small amounts of adequately characterized, focused data are preferable to large amounts of inadequately defined and controlled data stored in a random repository." The increasing size of data-sets and the growing risk of loss through catastrophic failure (such as a disk failure) has led to researchers to use cloud storage, perhaps too uncritically so.

The responsibilities of researchers for meeting the requirements of sound governance and ensuring the quality of their work have become more apparent. The article places the responsibility for curation firmly with the originator of the data. "Researchers should organize their data and preserve it with semantically rich metadata, captured at source, to provide short- and long-term advantages for sharing and collaboration."  Principal Investigators, as custodians, are particularly responsible for clinical data management and security (though curation and preservation activities exist in other research roles). "Curators usually attempt to add links to the original publications or source databases, but in practice, provenance records are often absent, incomplete or ad hoc, often despite curators’ best efforts. Also, manually managed provenance records are at higher risk of human error or falsification." There is a pressing need for training and education to encourage researchers to curate the data as they collect it at source.

"All science is strongly dependent on preserving, maintaining, and adding value to the research record, including the data, both raw and derived, generated during the scientific process. This statement leads naturally to the assertion that all science is strongly dependent on curation."

Monday, July 06, 2015


TIFF/A. Gary McGath. File Formats Blog.  July 3, 2015.
   The tiff format has been around for a long time. There have been many changes and additions, such that "TIFF today is the sum of a lot of unwritten rules".  A group of academic archivists have been working on a long term readable version, calling it TIFF/A. A white paper discusses the technical issues. Discussions starting in September will hope to create a version to submit for ISO consideration.

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models

Presentation on Evaluating the Creation and Preservation Challenges of Photogrammetry-based 3D Models. Michael J. Bennett. University of Connecticut. May 21, 2015.
    Photogrammetry allows for the creation of 3D objects from 2D photography, which mimics human stereo vision. There are many steps in the process, images, masks, depth maps, models, and textures. The question is, what should be archived for long term digital preservation? When models are output into an open standard, there is data loss, since “native 3D CAD file formats cannot be interpreted accurately in any but the original version of the original software product used to create the model.”

General lessons from archiving CAD files, are that, when possible, the data should be normalized into open standards. But native formats, which are often proprietary, should also be archived. With Photogrammetry Data, the author reviews some of the options and recommendations. There are difficulties with archiving the files, and also organizing the files in a container that are documents the relationships of the files. Digital repositories can play a role in the preservation of the 3D datasets.

Friday, July 03, 2015

Australian electronic books to be preserved at the National Library in Canberra under new laws

Australian electronic books to be preserved at the National Library in Canberra under new laws. Clarissa Thorp. ABC. 3 July 2015.
Starting in January of next year digital materials including e-books, blogs, prominent websites, and  important social media messages will be collected as a snapshot of Australian life. Under existing copyright laws, the National Library of Australia is able to collect all books produced by local publishers through the legal deposit system. Now with new legislation adopted by the Federal Parliament the Library will be able to preserve published items from the internet that could disappear from view in future. "This legislation puts us in a position where we are able to ask publishers to deposit electronic material with the National Library in a comprehensive way." "So we will be able to open that up and collect the whole of the Australian domain, for websites for example it means we are able to collect e-books that are only published in digital form." This new legislation will expand the Library's digital preservation program and ensure that future collections reflect Australian society as a whole.

Thursday, July 02, 2015

Vatican Library digitizes ancient manuscripts, makes them available for free

Vatican Library digitizes ancient manuscripts, makes them available for free. Justin Scuiletti.  PBS NewsHour. October 22, 2014.
The Vatican Apostolic Library is digitizing its archive of ancient manuscripts and making them available to view.  view. They are undertaking an extensive digital preservation of its 82,000 documents.  The entire undertaking is expected to take at least 15 years and cost more than $63 million. “Technology gives us the opportunity to think of the past while looking towards the future, and the world’s culture, thanks to the web, can truly become a common heritage, freely accessible to all, anywhere and any time.” The current list of digitized manuscripts can be viewed through the Vatican Library website  and the project website.

Wednesday, July 01, 2015

Over 28 exabytes of storage shipped last quarter

More than 28 billion gigabytes of storage shipped last quarter. Lucas Mearian. Computerworld. June 30, 2015.
Worldwide data storage hardware sales increased 41.4% over the same quarter in 2014. This past quarter, 28.3 exabytes of capacity was shipped out.  Traditional external arrays decreased while demand strongly increased for server-based storage and hyperscale infrastructure (distributed infrastructures that support cloud and big data processing, and can scale to thousands of servers). The largest revenue growth was in the server market (new server sales and not just upgrades to existing server infrastructures).  The most popular external storage arrays were all-flash models and hybrid flash arrays that combine NAND flash with hard disk drives.

Tuesday, June 30, 2015

National Archives kicks off 'born-digital' transfer

National Archives kicks off 'born-digital' transfer. Mark Say. UKAuthority. 24 June 2015.
The National Archives is looking at the long term issue of keeping records accessible as the technology in which they are originally created changes.

"To make sure born-digital records can be permanently preserved we’re engaged in what we call parsimonious presentation, in which we’re making sure it can be used by the next trends of technology being developed. We want them to be easily viewed in 10 years’ time, although we cannot plan for 100 years as there’s no way we can know what the technology will look like."

“To ensure records will still be used in the same way we want to see what the technology is going to do in the next 10 years.

“Digital preservation is a major international challenge. Digital technology is changing what it means to be an archive and we are responding to these changes.

“These records demonstrate how we are leading the archive sector in embracing the challenges of storing digital information for future generations. We are ensuring that we are ready to keep the nation’s public records safe and accessible for the future, whatever their format.”

Monday, June 29, 2015

File identification tools, part 5: FITS

File identification tools, part 5: FITS. Gary McGath. File Formats Blog.  June 25, 2015.
The File Information Tool Set (FITS), which aggregates results from several file identification tools, was created by the Harvard University Libraries and is available in Github. FITS uses Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard tools.  The tool can be used in the ingest process; it processes directories and subdirectories, and produces a single XML output file in various schemas. It can be run as a standalone tool or incorporated with other tools, and can be configured to determine which tools to run and which extensions to examine.  Documentation is found on Harvard’s website.

SIRF: Self-contained Information Retention Format

SIRF: Self-contained Information Retention Format. Sam Fineberg,et al. SNIA Tutorial. 2015. [PDF]
Generating and collecting very large data sets that need to be kept for long periods is a necessity for many organizations, included sciences, archives, commerce. The presentation describes the challenges with keeping data long term with Linear Tape File System (LTFS) technology and a Self-contained Information Retention Format (SIRF). The top external factors driving long-term retention requirements are: Legal risk, compliance regulations, business risk, and security risk.

What does long-term mean? Retention of 20 years or more is required by 70% of the responses in a poll.
  • 100 years: 38.8%
  • 50-100 years: 18.3%
  • 21-50 years: 31.1%
  • 11-20 years: 15.7%
  • 7-10 years: 12.3%
  • 3-5 years: 1.9%
The need for digital preservation:
  • Regulatory compliance and legal issues
  • Emerging web services and applications
  • Many other fixed-content repositories (Scientific data, libraries, movies, music, etc.)
Data stored should remain accessible, undamaged, and usable for as long as desired and at an affordable cost. Affordable depends on the "perceived future value of information". There are problems with verifying the correctness and authenticity of semantic information over time. SIRF is the digital equivalent of a self contained archival box. It contains:
  • set of preservation objects and a catalog (logical or physical)
  • metadata about the contents and individual objects
  • self describing standard catalog information so it can all be maintained
  • a "magic object" that identifies the container and version
The metadata contains basic information that can vary depending on the preservation needs. It allows a deeper description of t he objects along with the content meaning and the relationship between the objects.

When preserving objects, we need to keep all the information to make them fully usable in the future. No single technology will be "usable over the time-spans mandated by current digital preservation needs". LTFS technologies are "good for perhaps 10-20 years".

Saturday, June 27, 2015

Russian Official Proposes International Investigation Into U.S. Moon Landings. Cultural Preservation?

Russian Official Proposes International Investigation Into U.S. Moon Landings. Ingrid Burke. The Moscow Times.  June 16, 2015.
Russia's Investigative Committee spokesman, Vladimir Marki, called for an international investigation to (among other things) solve the mystery of the disappearance of film footage from the original moon landing in 1969. "But all of these scientific - or perhaps cultural - artifacts are part of the legacy of humanity, and their disappearance without a trace is our common loss. An investigation will reveal what happened."

 [Interesting that the political wranglings have now reached the level of  historical archiving and cultural preservation.]

Friday, June 26, 2015

ARSC Guide to Audio Preservation

ARSC Guide to Audio Preservation. Sam Brylawski, et al. National Recording Preservation Board of the Library of Congress. May 2015. [PDF, 252 pp.]
CLIR, the Association for Recorded Sound Collections (ARSC) and the National Recording Preservation Board (NRPB) of the Library of Congress, has published CLIR Publication No. 164, an excellent guide to audio preservation.
"Our audio legacy is at serious risk because of media deterioration, technological obsolescence, and, often, lack of accessibility. This legacy is remarkable in its diversity, ranging from wax cylinders of extinct Native American languages to tapes of local radio broadcasts, naturalists’ and ethnographers’ field recordings, small independent record company releases, and much more. These recordings are held not by a few large organizations, but by thousands of large and small institutions, and by individuals. The publishers hope that this guide will support and encourage efforts at all institutions to implement best practices to help meet the urgent challenge of audio preservation."

Chapters include:

  • Preserving Audio (Recorded Sound at Risk, Preservation Efforts, Roles)
  • Audio Formats: Characteristics and Deterioration (Physical, digital)
  • Appraisals and Priorities (Tools; Selection/collection policies, decisions)
  • Care and Maintenance (Handling, assessment) and arrangement
  • Description of Audio Recordings (Metadata, standards, tools)
  • Preservation Reformatting (Conversion to digital files, metadata, funding)
  • Digital Preservation and Access: Process, storage infrastructure
  • Audio Preservation: The Legal Context (Copyright, control, donor agreements)
  • Disaster Prevention, Preparedness, and Response
  • Fair Use and Sound Recordings Lessons
Some notes from reading the publication:
  • the ultimate goals of preservation are sustained discovery and use
  • all these dissimilar recordings together represent is an audio DNA of our culture
  • our enjoyment of the recordings has far exceeded our commitment to preserve them
  • history is represented in sound recordings; it entertains and enriches us
  • if compressed files are the only versions available to the public, we have no assurances that anyone is maintaining the higher fidelity originals
  • efforts of large and small institutions and private collectors are needed to make a meaningful dent in the enormous volume of significant recordings not yet digitized for preservation
  • if we are to preserve our audio legacy, all institutions with significant recordings must be part of the effort
  • proactive attention, care, and planning are critical to the future viability and value of both analog and digital recordings
  • institutions often have more items in their care than they have resources for adequate processing, cataloging, and preservation
  • the potential technical obsolescence of the hardware to play a recording should influence priorities and resources allocated for preservation
  • perhaps the most crucial feature a metadata schema is its degree of interoperability for sharing, searching, harvesting, and transformation or migration  
  • the preservation choice is not binary "either we implement intensive preservation immediately and forever; or we do nothing". We should not delay action because the ideal cannot be achieved
  • preservation metadata is the information needed to support the long-term management and
    usability of an object 
  • the Broadcast Wave Format (BWF) is the de facto standard for digital audio archiving
  • monitoring and planning to avoid obsolescence are important aspects of a solid digital preservation strategy
  • audio preservation is an ongoing process that may be challenging and intimidating; setting priorities is central to a successful preservation strategy
  • digital preservation will enable the fulfillment of the goal of long-term use (whether focused on education, scholarship, broadcasting, marketing, or sales)
  • ensure that there is at least one geographically separate copy of all digital content
  • recognize the use of sound recordings as sources of information by students and researchers
  • libraries and memory institutions should provide points of cultural reference for the current generation of creators
Several free, open source software tools are available
  • assessing audio collections for the purpose of setting preservation priorities
    • The Field Audio Collection Evaluation Tool (FACET)
    • Audio/Video Survey
    • Audiovisual Self-Assessment Tool (AvSAP)
    • MediaSCORE and MediaRIVERS
  • metadata tools
    • CollectiveAccess
    • Audio-Visual and Image Database (AVID)
    • AudioVisual Collaborative Cataloging (AVCC)
    • PBCore
 "When libraries, archives, and museums exercise their legal rights to preserve and facilitate
access to information, even without permission or payment, they are
furthering the goals of copyright."

"The professional management of a collection requires the development of criteria for selecting and preserving collections of sound recordings. A selection or collection development policy defines and sets priorities for the types of collections that are most appropriate and suitable for an organization to acquire and to preserve. The basis for these criteria should be the goals and objectives of the individual institution."

Thursday, June 25, 2015

Arizona State University and Northern Arizona University Select Ex Libris Rosetta.

Arizona State University and Northern Arizona University Select Ex Libris Rosetta. Ex Libris Press Release. June 25, 2015.
Arizona State University and Northern Arizona University have adopted the Rosetta digital asset management and preservation solution. Rosetta will enable the libraries to manage and preserve their digital collections, including born-digital objects such as web sites and research data, in perpetuity. With Rosetta, the three institutions will be able to implement the solution together and work off one infrastructure, providing end-to-end digital asset management and preservation for the vast array of assets in all of their libraries.

The two Arizona schools join the University of Arizona, already a Rosetta customer, to provide shared digital asset management and preservation service for all public higher education in the state.

Storing Digital Data for Eternity

Storing Digital Data for Eternity. Betsy Isaacson. Newsweek Tech & Science. June 22, 2015.
“People think by digitizing photographs, maps, we have preserved them forever, but we’ve only preserved them forever if we can continue to read the bits that encode them.” An example of data loss is NASA's Viking probes, where mission data were saved on magnetic tape. After 10 years, no one had the skills or software to read the data, and a portion of the data was permanently lost. The moral of this is to be skeptical of the promises of technology. Cloud technologies may feel safe, but there is no guarantee that the data will continue to exist.

There are some projects underway to build storage for digital data that doesn’t degrade. Some of these use quartz glass (which is ultra expensive with lasers that cost over $100,000); DNA (too slow to be practical to load data, and so complex that only only specialized labs can manage it, and as volatile as magnetic tapes); metal etched disks that can be read with an optical microscope; and the Long Server, an ever-growing database of file-conversion resources. And Vint Cerf's suggestion of creating “digital vellum,” a technique for packing and storing digital files along with all the code that’s needed to decrypt them.

Wednesday, June 24, 2015

Rosetta - version 4.2 released

Rosetta - version 4.2 released. Ex Libris. June 22, 2015.
The latest release of Rosetta is now available. It contains many system improvements and updates. Some of these include:
  • Enhanced ability for depositing large SIPs containing multiple files
  • Improved security features
  • Improved deposit functionality
  • Publishing of Itemized Sets
  • SIP load management 

Monday, June 22, 2015

Why Libraries Matter More Than Ever in the Age of Google.

Why Libraries Matter More Than Ever in the Age of Google. Amien Essif. AlterNet. May 23, 2015.
This article is in response to the book BiblioTech: Why Libraries Matter More Than Ever in the Age of Google. Of all the public and private institutions we have, the public library is the truest democratic space. The library’s value is obvious.  A Gallup survey found that libraries are not just popular, they are extremely popular. "Over 90% of Americans feel that libraries are a vital part of their communities, compared to 53% for the police, 27% for public schools, and 7% for Congress. This is perhaps the greatest success of the public sector."

Yet, a government report showed that while the nation’s public libraries served 298 million people in 2010 (96% of the U.S. population) funding has been cut drastically. “It seems extraordinary that a public service with such reach should be, in effect, punished despite its success.” Libraries are becoming more important, not less, to our communities and our democracy.

About 90% of all existing data is less than two years old.  Much of the information could be moderated for the public good, and libraries are able to do that. However, tech companies have put themselves into this role; "the risk of a small number of technically savvy, for-profit companies determining the bulk of what we read and how we read it is enormous."

Libraries are at risk because politicians are moving away from the public good, "favoring private enterprise and making conditions ripe for a Google-Apple-Amazon-Facebook oligopoly on information."
"It’s not too much of a stretch to say that the fate of well-informed, open, free republics could hinge on the future of libraries.”

Saturday, June 20, 2015

PREMIS Data Dictionary for Preservation Metadata, Version 3.0

PREMIS Data Dictionary for Preservation Metadata, Version 3.0. Library of congress.
June 10, 2015. [Full PDF]
The PREMIS Data Dictionary and its supporting documentation is a comprehensive, practical resource for implementing preservation metadata in digital archiving systems. The Data Dictionary is built on a data model that defines five entities: Intellectual Entities, Objects, Events, Rights, and Agents. Each semantic unit defined in the Data Dictionary is a property of one of the entities in the data model.

This new publications are:
  • PREMIS Data Dictionary. Version 3.0. This is the full document which includes the PREMIS Introduction, the Data Dictionary, Special Topics, and Glossary.
  • PREMIS Data Dictionary This document only has the Data Dictionary, introductory materials
  • Hierarchical Listing of Semantic Units: PREMIS Data Dictionary, Version 3.0
  • The Version 3.0 PREMIS Schema is not yet available
Version 3 of the Data Dictionary includes some major changes and additions to the Dictionary, which are:
  • Reposition Intellectual Entity as a category of Object to enable additional description within PREMIS and linking to related PREMIS entities.
  • Reposition Environments (i.e. hardware and software needed to use digital objects) so that they can be described and preserved reusing the Object entity. That is to say, they can be described as Intellectual Entities and preserved as Representation, File or Bitstream Objects.
  • Add physical Objects to the scope of PREMIS so that they can be described and related to digital objects.
  • Add a a new semantic unit to the Object entity: preservationLevelType (O, NR) to indicate the type of preservation functions expected to be applied to the object for the given preservation level.
  • Add a new semantic unit to the Agent entity to express the version of software Agents: agentVersion (O, NR).
  • Add a new semantic unit to the Event entity: eventDetailInformation (O, R)

There are major additions in the “PREMIS Data Model” and “Environment” sections.
The data model:

The entities in the PREMIS data model are:
  • Object: a unit subject to digital preservation.This can now be an environment.
  • Environment: technology supporting a Digital Object. Can now be as Intellectual Entity.
  • Event: an action concerning an Object or Agent associated with the preservation repository.
  • Agent: entity associated with Rights, Events, or an environment Object.
  • Rights Statement: Rights or permissions pertaining to an Object and/or Agent.
With the advent of Intellectual Entities in PREMIS 3.0, environments have been transformed. "Before version 3.0, there was an environment container within an Object that described the environment supporting that Object. If a non-environment Object needs to refer to an environment, it is now recommended that the environment is described as an Object in its own right and the two Objects are linked with a dependency relationship."