Saturday, October 03, 2015

Risk management guide for the secure disposal of electronic records

Secure destruction of electronic records. Archives New Zealand. 2 October 2015.
     Blog post on the secure and complete destruction of electronic records plus all copies and backups. Destruction of paper records are mostly straightforward. However it is not so easy to confidently delete electronic records. The processes to destroy digital records should be secure, irreversible, planned, documented and verifiable.  The article has examples of risks of not destroying records,  as well as resources on how to implement the destruction records. In addition there is a new guide on the benefits of disposal and the risks of not disposing of records: Risk management guide for disposal of records.

[Disposal and destruction of digital records may not seem like it has anything to do with digital preservation, but it is an important part of records management. More than just that, it can be a needed part of the submission and ingest processes made multiple copies of sensitive content have been created before or while adding the content.  Or if you have been given media to add and then must dispose of the media afterwards. -cle]

Friday, October 02, 2015

Bit Preservation: How do I ensure that data remains unharmed and readable over time?

Bitbevaring – hvordan sikrer jeg, at data forbliver uskadte og læsbare over tid? Eld Zierau, Det Kongelige Bibliotek.  Original November 2010; edited January 2015.
       Preservation of bits ensures that the values ​​and order of digital bits is correct, undamaged and readable. The bits are the same as when they were received, and by managing them they will be available in the future. If the bits are changed, in the best case the object will appear different, and in the worst case the object will be unreadable in the future. Fixity can only ensure that the bits are the same; it is important along with bit preservation to plan for the logical preservation as well to make sure that the file can be rendered.
Bit security is based upon assessing the risks to the objects and then protecting the objects from events that will change the bits. The more you protect the bit integrity of the files, the more confidence you have that the files are accurately preserved.

The traditional method of file security is to make multiple copies. Those copies must be checked regularly for errors that would then need to be corrected. All copies are equally important and must be checked. You must also make sure that the copies will not be affected by the same failure event. It that happens and the error is not discovered, you could lose all copies. This is part of the risk assessment process, and you should consider the following items in order to make sure at least one copy is intact:
  • Number of copies stored: The more copies stored, the more likely that at least one copy is intact
  • Frequency of checking copies: The more often copies are checked, the more likely that at least one copy is intact
  • Copies are stored independently, such as type of hardware, organizational custody, or geographical location, the greater chance that the copies won't be affected by the same problem
Integrity Check: Use a checksum to verify the integrity of the file and store the information. This is like a fingerprint to determine which files have not changed.

Media migration: Storage media do not last forever, so the digital content must be migrated regularly. It is important that the different copies are not exposed to the migration process at the same time.

Other considerations of bit preservation include understanding the cost; determining the level of object security desired; confidentiality of materials. The Royal Library, the National Archives and the National Library are working together to provide Bitmagasinet, a shared hosted service is to store data by cooperating with each other, with copies on different media, in different locations and at different organizations.

Thursday, October 01, 2015

Towards Sustainable Curation and Preservation: The SEAD Project’s Data Services Approach

Towards Sustainable Curation and Preservation: The SEAD Project’s Data Services Approach. James Myers, et al. IEEE International Conference on eScience. September 3, 2015. [PDF]
  This is a preview of a paper that will be presented at the conference on the Sustainable Environment: Actionable Data (SEAD). It details efforts to develop data management and curation services and to make those services available for active research groups to use. The introduction raises an apparent paradox: researchers face data management challenges yet curation practices that could help are used only after research work is completed (if at all). Adding data and metadata incrementally as the data are produced, the metadata could be used to help organize data during research.

If the system that preserved the data also generated citable persistent identifiers and dynamically updated the project’s web site with those citations, then completing the publication process would be in the best interest of the researcher. The discussions have revolved around two general areas that have been termed Active and Social Curation:
  1. Active Curation: focus primarily on the activities of data producers and curators working during research projects to produce published data collections. 
  2. Social Curation: explores how the  actions of the user community can be leveraged to provide further value. This could involve the ability of research groups to 
    1. publish derived value-added data products, 
    2. notify researchers when revisions or derived products appear, 
    3. monitor the mix of file formats and metadata to help determine migration strategies
SEAD’s initial capabilities are provided by three primary interacting components:
  1. Project Spaces: secure, self-managed storage and toolsto work with data resources
  2. Virtual Archive: a service that manages publication of data collections from Project Spaces to long-term repositories
  3. Researcher Network: personal and organizational profiles that can include literature and data publications.
SEAD has developed the ability to manage, curate, and publish to sustainability science projects data through hosted project spaces. This is a new option for projects that is more powerful than just using a shared file system and that is also more cost effective than a custom project solution.

Wednesday, September 30, 2015

Checking Your Digital Content: What is Fixity, and When Should I be Checking It?

Checking Your Digital Content: What is Fixity, and When Should I be Checking It? Paula De Stefano, et al. NDSA. October 2014.
     A fundamental goal of digital preservation is to verify that a object has not changed over time or during transfer processes. This is done by checking the “fixity” or stability of the digital content. The National Digital Stewardship Alliance provides this guide to help answer questions about fixity.

Fixity, the property of a digital file or object being fixed or unchanged, is synonymous with bit-level integrity and offers evidence that one set of bits is identical to another. PREMIS defines fixity as "information used to verify whether an object has been altered in an undocumented or unauthorized way." The most widely used tools for fixity are checksums (CRCs) and cryptographic hashes (MD5 and SHA algorithms). Fixity is a tool but by itself it is not sufficient to ensure long-term access to digital information. The fixity information must be used, such as audits of the objects, replacement or repair processes, and other methods to show that the object is or will be understandable. Long term access means the ability to "make sense of and use the contents of the file in the future".

Fixity information helps answer three primary questions:
  1. Have you received the files you expected?
  2. Is the data corrupted or altered from what you expected?
  3. Can you prove the data/files are what you intended and are not corrupt or altered? 
Fixity has other uses and benefits as well, which include:
  • Support the repair of corrupt or altered files by knowing which copy is correct 
  • Monitor hardware degradation: Fixity checks that fail at high rates may be an indication of media failure.
  • Provide confidence to others that the file or object is unchanged
  • Meet best practices such as ISO 16363/TRAC and NDSA Levels of Digital Preservation
  • Support the monitoring of processes to monitor content integrity as content is moved
  • Document provenance and history by maintaining and logging fixity information
Workflows for checking the fixity of digital content includes:
  • Generating/Checking Fixity Information on Ingest
  • Checking Fixity Information on Transfer
  • Checking Fixity at Regular Intervals
  • Building Fixity Checking into Storage Systems
Considerations for Fixity Check Frequency include:
  • Storage Media: Fixity checks increases media use, which could increase the rate of failure
  • Throughput: Your rate of fixity checking will depend on how fast you can run the checks
  • Number and Size of Files or Objects: Resource requirements change as the scale of objects increase
Fixity information may be stored in different ways, which will depend on your situation, such as:
  • In the object metadata records
  • In databases and logs
  • Alongside content, such as with BagIt

Tuesday, September 29, 2015

Do You Have an Institutional Data Policy?

Do You Have an Institutional Data Policy? A Review of the Current Landscape of Library Data Services and Institutional Data Policies. Kristin Briney, Abigail Goben, Lisa Zilinski. Journal of Librarianship and Scholarly Communication. 22 Sep 2015.  [PDF]
     This study was to look at a correlation between policy existence and either library data services or the presence of a data librarian. Data services in libraries are becoming mainstream and librarians have an opportunity to work with researchers at their institutions and help them understand the policies in place or to work toward a policy. Some items of note from the article:
  • Fewer universities have a data librarian on staff (37%) than offer data services.
  • Many libraries (65%) have a research data repository, either in an IR or in a repository specifically for data. 
  • Fewer universities (11%) have dedicated data repositories as compared with IRs that accept data (58%).
  • All universities with over $1 billion per year in research expenditures offer data services and a place to host data. Most (89%) of these institutions also have a data librarian. And (33%) have a data repository
  • Nearly half (44%) of all universities studied have some type of policy covering research data
    • Half of the policies designated an owner of university research data (67%) 
    • Data is required to be retained for some period of time (52%)
Standalone data policies covered many topics:
  • defined data (61%), 
  • identified a data owner (62%), 
  • state a specific retention time (62%), 
  • identified who can have access to the data (52%), 
  • described disposition of the data when a researcher leaves the university (64%) 
  • designate a data steward (46%) 
Data services are becoming a standard at major research institutions. However, institutional data policies are often difficult to identify and may be confusing for researchers. The trend of libraries having a data policy, offering data services, and having a data librarian will become typical at major research institutions. 

Monday, September 28, 2015

AWS glitch hits Netflix and Tinder, offers a wake-up call for others. Katherine Noyes. IDG News Service. Sep 21, 2015.
     A number of major websites were affected for a time by glitches in Amazon Web Services' Northern Virginia facility. This is a cautionary lesson to organizations that rely on the cloud service for mission-critical capabilities. The problem resulted in higher-than-normal error rates. Mission-critical systems should have massive redundancies. "In the end, Amazon does not have adequate failover protection, which means its customers need to make sure they do."  Any outage is a significant one for a cloud provider, but "all providers have outages."  More than anything, this is a wake-up call to design your storage to account for problems.

Friday, September 25, 2015

Data Management Practices Across an Institution: Survey and Report

Data Management Practices Across an Institution: Survey and Report. Cunera Buys, Pamela Shaw. Journal of Librarianship and Scholarly Communication. 22 Sep 2015.
     Data management is becoming increasingly important to researchers in all fields. The results of a survey show that both short and long term storage and preservation solutions are needed. When asked, 31% of respondents did not know how much storage they will need, which makes establishing a correctly sized research data storage service difficult. This study presents results from a survey of digital data management practices across all disciplines at a university. In the survey, 65% of faculty said it was important to share data, but less than half of the them "reported that they 'always' or 'frequently' shared their data openly, despite their belief in the importance of sharing".

Researchers produce a wide variety of data types and sizes, but most create no metadata or do not use metadata standards and most researchers were uncertain about how to meet the NSF data management plan requirements (only 45% had a plan). A study in 2011 of data storage and management needs across several academic institutions and found many researchers were satisfied with short-term data storage and management practices, but not satisfied with long-term data storage options. Researchers in the same study did not believe their institutions provided adequate funds, resources, or instruction on good data management practices. When asked about where research data is stored:
  • Sixty-six percent use computer hard drives
  • 47% use external hard drives
  • 50% use departmental or school servers
  • 38% store data on the instrument that generated the data
  • 31% use cloud-based storage services
    •  Dropbox was the most popular service at 63%
  • 27% use flash drives
  • 6% use external data repositories.

Most researchers expected to store raw and published data, “indefinitely”. Many respondents also selected 5-10 years, and very few said they keep data for less than one year. All schools all schools suggest that data are  relevant for long periods of time or indefinitely. Specific retention preferences by school were:
  • The college of arts and sciences prefers “indefinitely” for ALL data types
  • Published data: All schools prefer “indefinitely” for published data except
    • The law school prefers 1-5 years for published data
  • Other data:
    • The school of medicine prefers 5-10 years for all other data types
    • The school of engineering prefers 1-5 years for all other data types
    • The college of arts and sciences “Indefinitely” for raw data
    • The school of management “Indefinitely” for raw data

Keeping raw data / source material was useful since researchers may use it for
  • future / new studies (77 responses), 
  • utilize it for longitudinal studies (9 responses)
  • share it with colleagues (6 responses). 
  • valuable for replicating study results (10 responses), 
  • responding to challenges of published results, 
  • data would be difficult or costly to replicate 
  • simply stated that it is good scientific practice to retain data (4 responses).

When asked, 66% indicated they would need additional storage; most said 1-500 gigabytes or  “don’t know.” Also, when asked what services would be useful in managing research data the top responses were:
  • long term data access and preservation (63%), 
  • services for data storage and backup during active projects(60%), 
  • information regarding data best practices (58%), 
  • information about developing data management plans or other data policies (52%), 
  • assistance with data sharing/management requirements of funding agencies (48%), and 
  • tools for sharing research (48%).
Since most respondents said they planned to keep their data indefinitely, that means that institutional storage solutions would need to accommodate "many data types and uncertain storage capacity needs over long periods of time". The university studied lacks a long term storage solution for large data, but has short term storage available. Since many researchers store data on personal or laboratory computers, laboratory equipment, and USB drives, there is a greater risk of data loss. There appears to be a need to educate researchers on best practices for data storage and backup.

There appears to be a need to educate researchers on external data repositories that are available and on funding agencies’ requirements for data retention. The library decided to provide a clear set of  funder data retention policies linked from the library’s data management web guide. Long-term storage of data is a problem for researchers because of the data and the lack of stable storage solutions and that limits data retention and sharing.

veraPDF releases prototype validation library for PDF/A-1b

veraPDF releases prototype validation library for PDF/A-1b. News release. veraPDF consortium. 16 September 2015.
     Version 0.4 of the veraPDF validation library is now available. This release delivers a working validation model and validator, an initial, PDF/A-1b validation profile; and a prototype of the PDF feature reporting. This early version allows users to test this implementation of PDF/A-1b validation on single files. The roadmap for 2015 - 2017 is available.

Backblaze releases raw data on all 41,000 HDDs in its data center

Backblaze releases raw data on all 41,000 HDDs in its data center. Lucas Mearian. Computerworld. Feb 4, 2015.
     Backblaze has provided the files of raw data collected over three years on the reliability of over  41,000 hard disk drives in its data center. They have made the information available before, but these are the raw data files that can be used to examine the data in greater detail.  The two files contain the 2013 data and the 2014 data. The 2015 results will be released when they're available.

 "You may download and use this data for free for your own purpose, all we ask is three things 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone, it is free."  Backblaze uses all consumer-class drives to keep its costs down and says they "are as reliable as expensive enterprise-class drives."  The results show "Western Digital's drives lasted an average of 2.5 years, while Hitachi's and Seagate's lasted 2 and 1.4 years, respectively. Even so, some of the individual Hitachi models topped the reliability charts."

Thursday, September 24, 2015

Backblaze takes on Google, Amazon with storage at half a penny a gigabyte

Backblaze takes on Google, Amazon with storage at half a penny a gigabyte. Lucas Mearian. Computerworld. September  22, 2015.
     Backblaze announced decreased prices for its new B2 Cloud Storage. This cloud storage service is not encrypted or manipulated in any way, but users can encrypt their own files beforehand. The first 10GB are free then the cost is "half a penny per gigabyte" regardless of the amount of storage. Upload is free and download is $0.05 per GB. The price of $0.005 is less expensive than Amazon's Glacier which also has limits on data retrieval comes with the caveat that retrieving your data may take hours. "We'll offer access via a web GUI. Anyone can access it via our web interface. Then we offer a command line interface for IT people and if you're a developer, there's an API for the service."

Hill Museum & Manuscript Library awarded grant preserving manuscript collections

Hill Museum & Manuscript Library awarded $4 million grant from Arcadia Fund. Press release. St. Johns University. September 22, 2015.
     Arcadia Fund has awarded a grant to the Hill Museum & Manuscript Library (HMML) at Saint John's University for photographic manuscript preservation, which is HMML's core mission. Rev. Columba Stewart said HMML is "currently preserving manuscript collections at sites in Lebanon, Iraq, the Old City of Jerusalem, Egypt, Mali and Malta" and the digital images and related cataloging will be available to scholars throughout the world. HMML has formed partnerships with over 520 libraries and archives to photograph more than 145,000 manuscripts from Europe, Africa, the Middle East and India.

Wednesday, September 23, 2015

A Selection of Research Data Management Tools Throughout the Data Lifecycle

A Selection of Research Data Management Tools Throughout the Data Lifecycle. Jan Krause. Ecole polytechnique fédérale de Lausanne. September 9, 2015. [PDF]
     This article looks at the data lifecycle management phases and the many tools that exist to help manage data throughout the process. These tools will help researchers make the most out of their data, save time in the long run, promote reproducible research, and minimize the risks with the data. The lifecycle management phases are: discovery, acquisition, analysis, collaboration, writing, publication and deposit in trusted data repositories.  There are tools in each of the areas. A few of the many tools listed are:
It is important to use appropriated data and metadata standards, especially data formats, which should happen at the beginning since these are difficult to after the project is started.

Tuesday, September 22, 2015

Taking Control: Identifying Motivations for Migrating Library Digital Asset Management Systems

Taking Control: Identifying Motivations for Migrating Library Digital Asset Management Systems. Ayla Stein, Santi Thompson. D-Lib Magazine. September/October 2015.
     "Digital asset management systems (DAMS) have become important tools for collecting, preserving, and disseminating digitized and born digital content to library patrons." This article looks at why institutions are migrating to other systems and in what direction. Often migrations happen as libraries refine their needs. The literature on the migration process and the implications is limited; this article provide several case studies of repository migration.A presentation by Lisa Gregory "demonstrated the important role digital preservation plays in deciding to migrate from one DAMS to another and reiterated the need for preservation issues and standards to be incorporated into the tools and best practices used by librarians when implementing a DAMS migration".  Repository migration gives institutions the opportunity to move from one type of repository, such as home grown or proprietary, to another type.  Some of the reasons that institutions migrated to other repositories (by those ranked number 1) are:
  • Implementation & Day-to-Day Costs
  • Preservation
  • Extensibility
  • Content Management
  • Metadata Standards
Formats they wanted in the new system included:

Response Num. %
PDF 28 98
JPEG 26 90
MP3 22 76
JPEG2000 21 72
TIFF 21 72
MP4 19 66
MOV 17 59
CSV 16 55
DOC 13 45
DOCX 12 41

For metadata, they wanted the new system to support multiple metadata schema; administrative, preservation, structural, and/or technical metadata standards; local and user created metadata, and linked data. In addition, METS and PREMIS were highly desirable.

The new system should support, among others:
  • Ability to create modules/plugins/widgets/APIs, etc.  
  • Support DOIs and ORCIDs
Preservation features and functionality were the ability to:
  • generate checksum values for ingested digital assets.
  • perform fixity verification for ingested digital assets.
  • assign unique identifiers for each AIP
  • support PREMIS or local preservation metadata schema.
  • produce AIPs.
  • integrate with other digital preservation tools.
  • synchronize content with other storage systems (including off site locations).
  • support multiple copies of the repository — including dark and light (open and closed) instances.
The survey suggests that "many information professionals are focused on creating a mechanism to ensure the integrity of digital objects." Other curatorial actions were viewed as important, but some "inconclusive results lend further support claims of a disconnect between digital preservation theory and daily practices". About two-thirds were moving to open source repositories, while one fifth were moving to proprietary.

Monday, September 21, 2015

Archiving a digital history

Archiving a digital history: Preserving Penn State’s heritage one link at a time. Katie Jacobs Bohn. Penn State News. September 18, 2015.
     Archivists need a way to preserve digital artifacts so future historians have access to them. This includes content on the internet that can disappear in a short time. Archive-It is a service Penn State archivists are using to make copies of Web pages and arrange them in collections. They want to digitally preserve their cultural heritage, including the University’s academic and administrative information that is published on the Web.

“Web archiving is important because so much of Penn State’s media is ‘born digital,’ or in other words, there’s never a physical copy."  “But we still need a way to keep and preserve this material so it’s not lost forever.” Some quotes from the article:
  • Preservation requires more than just backup. 
  • "Technology is constantly evolving, and it’s hard to know what digital archiving will look like 50 years from now, let alone hundreds."
  • “In the right environment, paper will last hundreds of years, but digital information has a lot of dependencies. To be able to access digital files in the future, you may need a certain kind of hardware and operating system, a compatible version of the software to open the file, not to mention electricity.” 
  • “A lot of digital preservation work involves mitigating the risks associated with these dependencies. For example, trying to use open file formats so you don’t need specific software programs that may no longer be around to access them.”
Regardless of what they are trying to preserve, archivists have difficulties with trying to manage the ephemeral nature of culture and history.

Thursday, September 17, 2015

Tracing the contours of digital transformation, Part One

Tracing the contours of digital transformation, Part One. September 11, 2015.
     Interesting article about changing our institutions to encompass digital technology.  Some quotes and notes:
  • “digital” transformation is, at it’s most fundamental level, not about digital technologies, but about people, mindsets, relationships, and things. 
  • transforming our processes will deliver transformed products more effectively
  • Delivering innovative (and even revolutionary) experiences is a lot easier to do from a position of knowing what you are (and aren’t) about. 
  • there’s still plenty of work to be done to thoughtfully tackle the big issue of digital transformation and become a postdigital institution, "one that has normalized and internalized digital technologies to an extent that they permeate the whole institution and how the institution works".

Enduring Access to Rich Media Content: Understanding Use and Usability Requirements

Enduring Access to Rich Media Content: Understanding Use and Usability Requirements. Madeleine Casad, Oya Y. Rieger and Desiree Alexander. D-Lib Magazine. September 2015.
     Media art has been around for for 40 years and presents serious preservation challenges and obsolescence risks, such as being stored on fragile media. Currently there are no archival best practices for these materials.
  • Interactive digital assets are far more complex to preserve and manage than regular files. 
  • A single interactive work may contain many media files with different types, formats, applications and operating systems. Any failure and the entire presentation may be unviewable.
  • Even a minor problem with can compromise an artwork's "meaning." 
  • Migrating information files to another storage medium is not enough to preserve their most important cultural content. 
  • Emulation is not always an ideal access strategy since it can introduce additional rendering problems and change the original experience.
The article surveyed media art researchers, curators, and artists, in order to better understand the "relative importance of the artworks' most important characteristics for different kinds of media archives patrons." Some of the problems mentioned were the lack of documentation and metadata, discovery and access, and technical support. Also problems with vanishing webpages, link rot, and poor indexing. 

Artists are concerned about the longevity of their creative work; it can be difficult selling works that may become obsolete within a year.  Curators of new media art may not include born-digital interactive media in their holdings because they are too complex or unsustainable. Some preservation strategies rely on migration, metadata creation, maintaining a media preservation lab, providing climate controlled storage, and collecting documentation from the artists. 

They are also concerned about "authenticity" in a cultural rather than technical sense. InterPARES defines an authentic record as "a record that is what it purports to be and is free from tampering or corruption". With digital art this becomes more difficult to do, since restoring ephemeral, technological or experiential artwork may alter its original form in ways that can affect its meaning. Authenticity may be more of "a sense that the archiving institution has made a good-faith commitment to ensuring that the artist's creative vision has been respected, and providing necessary context of interpretation for understanding that vision—and any unavoidable deviations from it".

Curators need to work with artists to ensure that artworks' most significant properties and interpretive contexts were preserved and documented.  This is more than ensuring bit-level fixity checks or technically accurate renderings of an artwork's contents. The key to digital media preservation is variability, not fixity; finding ways to capture the experience so future generations will get a glimpse of how early digital artworks were created, experienced, and interpreted.

Therefore, diligent curation practices are more essential than ever in order to identify unique or exemplary works, project future use, assess loss risks, and implement cost-efficient strategies.

AWS Storage Update: New Lower Cost S3 Storage Option & Glacier Price Reduction

AWS Storage Update: New Lower Cost S3 Storage Option & Glacier Price Reduction. Jeff Barr. Amazon. 16 September 2015.
      Changes in Amazon pricing and storage options:  Amazon is adding a new storage class for data that is accessed infrequently, the S3 Standard – Infrequent Access, along with Standard and Glacier. This has all of the existing S3 security and access management, data life-cycle policies, cross-region replication, and event notifications features.

Prices for Standard – IA start at $0.0125 / gigabyte / month with a 30 day minimum storage duration and a $0.01 / gigabyte charge for retrieval plus transfer and request charges). Objects smaller than 128 kilobytes are charged for 128 kilobytes of storage. Data life-cycle policies can be defined to move data between Amazon S3 storage classes over time.

Also, the price of Amazon Glacier storage has decreased by up to 36%, based on the region, which is available for as little as $0.007/GB per month.

Challenges facing African music archives

Challenges facing African music archives. Diane Thram. Music In Africa. Sep 15, 2015.
     Music archives in Africa have cultural heritage collections of recordings and musical instruments that  need conservation, dissemination and return to their original communities. Current issues for music heritage archives include sustainability, archiving practice, digital preservation, and internet access. The role of music archives is still emerging. These issues are challenges that few music archives in Africa have been able to meet, largely due to lack of funding and expertise.

The issues of sustainability and the need for funding are perpetual issues faced by most music archives. Creation and publication of audio-visual and print materials, research projects and publications from research, outreach and education, and repatriation projects should all also be part of on-going operations.

Wednesday, September 16, 2015

Extending OAI-PMH over structured P2P networks for digital preservation

Extending OAI-PMH over structured P2P networks for digital preservation. Everton F. R. Seára, et al. International Journal on Digital Libraries. July 2012.  [PDF]
    An older but interesting article about OAI-PMH and peer to peer networks for preservation. OAI-PMH is a protocol for harvesting metadata from repositories that store digital objects with structured metadata. Since only metadata is shared there is no digital data preservation and digital objects can be easily lost. An alternative to digital preservation of content is to replicate the information in multiple storage repositories for a long-term data preservation environment that ensures reliability and availability of their objects. The article looks at using a proposed OAI-PMH extension to work over a distributed system that can replicate a digital object between nodes in a distributed P2P archiving system.

Important Win for Fair Use

Important Win for Fair Use in ‘Dancing Baby’ Lawsuit. Electronic Frontier Foundation. September 14, 2015.
     A federal court affirmed in a ruling about Lenz v. Universal, often called the “dancing baby” lawsuit, that copyright holders must consider fair use of material sending a copyright takedown notice.  The United States Court of Appeals for the Ninth Circuit ruled that copyright holders must consider fair use before trying to remove content from the Internet. It also rejected the claim that "a victim of takedown abuse cannot vindicate her rights if she cannot show actual monetary loss."
  • “Today’s ruling sends a strong message that copyright law does not authorize thoughtless censorship of lawful speech. 
  • We’re pleased that the court recognized that ignoring fair use rights makes content holders liable for damages.”

Tuesday, September 15, 2015

UK Government: What we’re doing on open standards

What we’re doing on open standards. Government Technology Team. 7 September 2015.
     The UK government technology team has been selecting open standards to help government to adapt to changing needs and technologies. "Open Document Format (ODF) is an important standard - by making documents in a format that is open to all, we are ensuring that there are no barriers or bias when we provide services to the public."  The underlying principle is that people should not have to buy new equipment or software in order to read an official document. The technology team looked at the formats that should be used for government documents and chose the Open Document Format. Since then:
  • all of central government has committed to moving to the ODF 1.2 for their editable documents
  • most departments have published their implementation plans
  • the proportion of ODF documents on GOV.UK is increasing steadily
  • software suppliers are providing better support for open formats in their products
A guidance manual has been created to help the government departments a they move to ODF. The manual includes topics such as:
  • Introduction to Open Document Format (ODF)  
  • Procure ODF solutions
  • Validators and compliance testing 
  • Platforms and devices 
  • Accessibility, Privacy and security 
  • Best practices and other information

Oklahoma State University Selects Ex Libris Rosetta for Digital Preservation

Oklahoma State University Chooses Ex Libris Rosetta for Digital Preservation. Press Release. Ex Libris. September 15, 2015.
     Oklahoma State University has adopted the Rosetta digital-asset management and preservation solution. "Rosetta encompasses the entire workflow for managing and preserving digital assets, including their validation, ingest, storage, preservation and delivery. The Rosetta solution handles institutional documents, research output in digital formats, digital images, websites and other digitally-born and digitized materials." The first Rosetta preservation project will be the Oklahoma Oral History Research Program.

The Dean has prioritized the creation of digital content at the OSU Libraries and recognizes the need for long-term preservation and management of these collections. "Rosetta will enable us to develop a sustainable digital preservation program. After evaluating several commercial digital preservation systems, we found Rosetta has the capabilities that we were seeking."

EPUB file validator and guidelines

Epubcheck 4.0.0 Available for Download. EPUBZone website. September 8, 2015.
     The latest version of Epubcheck is now available on Github.  This open source tool validates EPUB documents and makes sure they conform to the latest specifications. It is also used to provide validation information at the online idpf validator website and the iBooks Store.

The iBooks site also has support helps for resolving the errors and tips about EPUB namespace and adding alt text to images. Other EPUB guidelines can be found at: EPUB 3 Accessibility Guidelines.

Monday, September 14, 2015

MediaTrace: A Comprehensive Architecture Report for AudioVisual Data

Announcing MediaTrace: A Comprehensive Architecture Report for AudioVisual Data. website. September 13, 2015.
     MediaTrace is a new reporting tool that documents the structure and contents of digital files, particularly with audiovisual data. The complementary tool from MediaInfo summarizes a file’s significant characteristics; it provides comprehensive documentation of the file information in an XML format with an XML Schema and a Data Dictionary.

MediaTrace XML
  • itemizes and describes parts of a digital file in a comprehensive file index. 
  • documents elemental contents such as text strings, short binary values, numbers, and dates. 
  • for a media file it will document the size.
  • focuses on comprehensively documenting the file structure as a whole 
The MediaTrace Schema contains the elements:
  • MediaTrace: provides the root level element of the document
  • block: documents a structural piece or elemental component of a digital file's bitstream data
  • data: document the lowest-level and most granular aspect of the file's contents
MediaTrace has been developed by MediaArea with collaboration with the Museum of Modern Art. MediaTrace is also developed as part of the MediaConch, a PREFORMA project.

Testing Old Tapes For Playability

Testing Old Tapes For Playability. Katharine Gammon. Chemical & Engineering News. September 8, 2015.
     Many audio recordings, a large part of the world’s cultural history, are in danger of degrading and being lost forever. A new infrared spectroscopy technique offers a noninvasive way to quickly separate magnetic tapes that can still be played from those that can’t. This could help archivists decide  which tapes need special handling before they get any worse.
The article refers to a published paper:

Minimally Invasive Identification of Degraded Polyester-Urethane Magnetic Tape Using Attenuated Total Reflection Fourier Transform Infrared Spectroscopy and Multivariate Statistics. Brianna  M.  Cassidy, et al. American Chemical Society. August 26, 2015. [PDF]
     The Cultural Heritage Index estimates there are 46 million magnetic tapes (VHS, cassette, and others) in museums and archives in the U.S. and about 40% of them are of unknown quality. Many of these tapes are reaching the end of their playable lifetime and there is not enough equipment to digitize all of them before the world loses them. Heat and humidity increase the tape degradation. The project was to "develop an easy, noninvasive method to identify the tapes that are in the most danger, so that they can be prioritized for digitization.”

George Blood said “It’s definitely a race against time, and in around 20 years we won’t be able to play back anything.” The availability of modeling tools for identifying degraded tapes will increase efficiency in digitization and improve preservation efforts for magnetic tape.

Digital Preservation the Hard Way: I may have deleted the Electronic Theses community

Digital Preservation the Hard Way. Hardy Pottinger. University of Missouri Library. Open Repositories 2015 Poster Session.June 9, 2015. [PDF]
Interesting poster at OR2015. Examines a situation of DSpace missing collections and what was done to restore them. Some interesting quotes from the poster and the abstract:

"This is not a story of how we used this tool set. This is a story of how we recovered from an accidental deletion of a significant number of items, collections, and communities--an entire campus's ETDs: 315 missing items, 878 missing bitstreams, 1.4GB of data, 7 missing communities, 11 missing collections--using a database snapshot and a tape backup. The SQL we developed to facilitate this restoration may be helpful, but it is our hope that in comparison, the effort required to implement a proper backup and preservation safeguard, such as DuraCloud and/or the Replication Task Suite, will rightly seem more appealing. In other words: here's how to do it the wrong way, but you'd really be better off doing things the right way.

“I may have deleted the Electronic Theses community. Is there any way to un-delete it?”
Seven missing communities, 11 missing collections, 315 missing items (containing 878 bitstreams). Only 1.4 GB of data. Born-digital data. This is a story of survival. Of that data, metadata, and everyone responsible for it.

“Disaster Recovery is not Digital Preservation.”
If you are a developer reading this, thinking about ways your institution could improve its backup strategy, I've got bad news for you. A backup strategy is not digital preservation.
     If you are the only person at your institution thinking of ways to improve your backup strategy, your first job, before you even start to change your backup strategy, is to find other people to work with you on digital preservation. Repository development is a difficult enough job, you do not need to also assume responsibility for ensuring the data you are storing is valid, usable and backed up.
     Digital preservation is a full time job. If you are not giving it your full attention, if you are just backing up your assetstore and database, you have already accepted the responsibility of digital preservation, you are just not doing the job.

Saturday, September 12, 2015

The Trouble With Digitizing History

The Trouble With Digitizing History. Tina Amirtha.  Fast Company. September 11, 2015.
     Sound and Vision and two other national institutions finished digitizing the Netherlands’ audiovisual archives last year at a cost of $202 million over seven years. The project digitized 138,932 hours of film and video, 310,566 hours of audio, and 2,418,872 photos. Of these, only 2.3% of its digitized archive is publicly available online. Schools and researchers are allowed to access 15% of the archive the website while copyright concerns cover the rest.
  • "It doesn’t make sense to digitize everything. "You have ask yourself, ‘Who are you doing this for?’" Researchers may be interested in a narrow set of media, while the public may prefer a skim of the archives.
  • "Honestly, only a little bit of the funding should go towards digitization and the rest, towards digital preservation".
  • Digitization on its own won’t bring memory institutions into modernity, but the innovation will come from refining the methods used for preserving the digital files.
  • "Any new technology that better preserves and increases public access to these audio and video materials should aim to fulfill the greater mission of any national audiovisual archive: to be the "media memory" of the country."
  • "Collecting everything in one place online, it’s a very linear way of thinking."
The archivist sees an archive’s role in society changing by decentralizing the selection decisions and creating direct relationships with the creative community. "Sound and Vision has eliminated all of its curators and now trusts the community to curate its media memory. Archives need to be at the start of the creative process".

Regardless of which online content platform is used to host creative media,the important thing is for the library to continue professionally archiving today's digital recordings.

"There is no one single place that can serve the world’s creative output. The more we can collaborate, nationally and internationally, the more successful we’re going to be,"

Friday, September 11, 2015

Testing a Permanent Digital Storage Archive – Part 2: OAS and Rosetta

Testing Permanent Digital Storage Archive – Part 2: OAS and Rosetta. Chris Erickson. September 10, 2015.
     The Optical Archive System from Hitachi LG Data Storage fits in a server rack and can contain 10 units, called libraries. Each library unit contains 100TB of data storage on 500 long-term optical discs. More information at Rosetta Users Group 2015: New Sources and Storage Options For Rosetta (slides 13 – 16) or this YouTube video.

Connecting the OAS and Rosetta Systems:
Once the optical archive was installed in our Library, it was then connected to our Rosetta system, which was very easy to do and only took a couple of minutes. In the Rosetta administrative module I created a new File storage group with the OAS path and the storage capacity. The IE and Metadata storage groups were left as they were, directed to our library server. The files in those groups are much smaller and accessed more often than the files. I then added a new storage rule so Rosetta could determine whether to write the files to our library server, to our Amazon storage account, or to the OAS.

Write functionality:
When the data is written to the optical discs a fixity check is done to ensure that the file is 100 % accurate. Once the file is written to the optical disc, the data is permanent.  Even if the system were to go away, the data discs are permanent and could still be read on any Blu-ray device.  I ingested a couple hundred GBs into Rosetta which were then written out to the OAS discs. (Overall I added over 4 TB of data.) We never encountered any difficulties with writing data to the OAS. We did try to disrupt or corrupt the writing process to see if we could get it to fail or to write bad data, but even our systems engineer with root access was unable to affect the data in any way.

Normally our test Rosetta system is configured for only a small number of files, so there is limited processing space, about 45GB. (Our live production Rosetta system has 2 TB of processing space). Because of the limited processing space on the test server, I could not run an unrestricted ingest without filling up the disk space. So I ingested a limited number of items at a time and then also cleared the processing space before ingesting more. The chart below shows the ingest amounts for two of the afternoons when the ingest processes were run-each took about 5 hours. (An unrestricted ingest would likely result in at least four times as many items per day.)

Read functionality:
This is an optical device, so I did not know if Rosetta would be able to read the discs. And since it is an optical device the OAS has to locate the correct disc and load the disc in a drive to retrieve the data (there are 12 read / write drives for each library). The retrieval process can take up to 90 seconds. Our Rosetta system is used as a dark archive, so the retrieval time was not a problem. The question was whether or not Rosetta would wait while the file was being retrieved or if it would time out. From the first request, the OAS read functionality worked flawlessly. Rosetta worked well with the retrieval / access time while the disc was retrieved and the file read. Once the disc was in the drive, access for any other files on the disc was about as fast as if it were on spinning disc.

Here is a chart of access times for one of the groups that I checked:

    Item size
     Access time

in Item
List of titles of genealogical articles
Jackson collection image
Jackson collection image
John O. Bird children
Cardston Alberta Temple
F Edwards
E O Haymond
 Taj Mahal,
 Taj Mahal,
Millie Gallup
History of the Lemen family
The Boynton family.
Register and almanac
The crawfishes of the state
Parley D. Thomas
Blake family : a genealogical history
From the access time column it is obvious when a new disc is retrieved, as the time is over 60 seconds. Once the disc has been loaded then the access time for subsequent files is much lower.These access times are for the master files, which can be quite large.

The setup process, writing and reading all went extremely well. The next step was to run an automated fixity check on the OAS files from within Rosetta.
(Updated to clarify and answer questions.)