FREE YOUR WORKFLOWS
Ashley Blewer, NYPL && Dinah Handel, CUNY TV
open sourcing audio visual archiving and preservation workflows, software, and file formats
open sourcing audio visual archiving and preservation workflows, software, and file formats
[Introductions] Hi I'm Ashley this is Dinah, blahblahblah, Dinah is going to start off this talk by talking about open sourcing audio visual archiving and preservation workflows, software, and file formats in THEORY.
THEORY
There's a few assumptions that we need to make about library and archive work.
Librarians and archivists love good documentation.
First, librarians and archivists try to create, and think it is important to have, clear, concise, good documentation of workflows. A large part of many librarians and archivists work is to figure out how something is done (how materials are accessioned, processed, or digitized for example), and translate those steps- what we consider a "workflow," into a piece of documentation.
Librarians and archivists care about making information accessible.
The second assumption is that librarians and archivists care about making information accessible to others. This is sort of a fundamental principle of librarianship, and although it certainly becomes more complicated in practice, we typically describe the library and archive work that we do in service of access.
These two concepts often are not paired together.
Yet, often, these two things don’t go hand and hand with each other. We’re more concerned, and rightly so, with making the materials we have in our collections accessible to patrons, rather than sharing how we made the material accessible. Occaisionally, we might make presentations, write blog posts or journal articles, or even make public versions of internal workflows available on our library websites. But overall there isn't an infrastructure or expectation in place for sharing the nitty gritty details of library and archival workflows.
Why the gap between theory and practice?
"... but it’s not our fault!!!" We all want to do the best thing, but some things get in the way.
Time & Money
Time and money, like with many things, play a huge role in our ability to turn project dreams into project reality.
Writing the docs
(but, like, actually writing them though)
Writing documentation is hard! It’s a real skill to bridge the gap between a code base and human comprehension.
Scary
It’s also scary to put things out there on the internet. There’s a shared sense of imposter syndrome, regardless of the effort made or size of the institution, that the code will be “not good enough.” Not enough tests, the code “isn’t clean enough,” or not enough documentation.
Feedback
It can also be hard to deal with feedback. Of course dealing with negative feedback is hard, but even having to spend time answering questions or clarifying or helping other people use your codebase takes time that is likely not allocated and dedicated to that cause.
A/V is hard!
On top of all of this, a/v is *really* hard compared to text or images, especially when someone is more of a generalist. Codecs wrappers, format migration, the differences between digital and analog, or even trying to answer *what video is* can be challenging. From an archival standpoint, quick initial accessioning just isn’t a reality because in analog video and some digital video, files can’t even be played back prior to migration. And once analog materials are migrated to a stable digital format, the filesizes are huge. And their care can be made further burdensome by a diverse range of formats, and there’s no one generally accepted archival-grade format. Just a lot of opinions. And not one format is gonna work for every institution. A small insitution's solution is probably not going to be the same as Library of Congress's solution (for example).
A/V is hard!
Historically a/v has had to deal with proprietary formats and software moreso than other forms of media found in libraries, likely due to their complexity and large file sizes. Processing power of the average consumer computer has only recently caught up with the needs of processing a/v materials, so standard video cards are sufficient for doing processing of archival-grade uncompressed digitized film and video. So we can now do in-house what we used to have to rely on vendors and companies to do for us.
open workflows
To contextualize the rest of our talk, it might be good to lay down some definitions, so that we’re all on the same page when we’re talking about these technologies. Each of these terms or concepts that I’ll talk about play an important role in opening sourcing workflows.
open source
First, what do we mean by open source? The open source initiative defines open source software as “software that can be freely used, changed, and shared (in modified and unmodified form) by anyone.”
open access
What we mean by open access is unrestricted access to information. open access is commonly used in the world of academic publishing to denote a work that is freely available to read without subscription to journals or databases. The same applies to code- people have to be able to find and download the source code for free.
open file formats
Open file formats are a third significant component. A file format defines the structure and type of data that is stored in a file. An open file format, then, is the accessible, published specification for the structure and storage of data in a file. Open file formats are usually maintained by a standards organization. An open file format is platform independent, meaning that it can be used across software and operating systems. Using an open file format with publicly available specifications is necessary in digital preservation, as it allows for us to continue to render digital objects as operating systems and software changes, which is crucial in combating obsolescence. In particular, audio visual file formats contain a lot of information about the file, and with proprietary file formats it makes future playback difficult.
microservices
A micro-services framework breaks down extensive, multi-step processes into distinct pieces. Each micro service accomplishes one discrete task. I like to think about micro-services as modular code that can be combined in numerous ways depending on the desired outcome.
#goals
Earlier I talked about the fears of working in the open, but here are some of the benefits of working in the open.
many hands make light work (and less stress)
When thinking again about the gap between theory and practice, a lot of this weighs more heavily if the codebase burden is just on one person or institution (or one person AT one institution), and can be alleviated when a project exists collaboratively.
staying alive
Staying alive. More likely to keep project alive and thriving when many people care about it. As librarians, I think we’re familiar with vendor lock keeping an institution locked into one system by preventing free access to their own data. But with our software, we should be aware of institution-lock. How can we create things that can be extended to other institutions?
Sharing vulnerability
Sharing vulnerability makes the project stronger. Knowing that your project is out there will also make your project stronger. And having more people using your project will help you discover bugs and make it better.
Global impact
Global impact -- not restricted to one geographical location. We are all connected, let's connect in ways that benefit all of us.
Tangibility
Having open workflows can make what are usually described as abstract processes more tangible.
Microservices (at institutional level)
To bring this around to practice, implementation can be translated across institutions.
At NYPL, on my team, we really strive for a microservices approach to development, and I'm speaking about microservices in a larger context than our primary definition in this talk. We want to have the ability to switch out any one part of a project for something better that comes along, and not have everything tangled up in one, overly large, bloated application. (I know y'all know what I'm talking about) And we do this by having clear end-points to send data from one place to another. BUT What better way to know that one software component of your overall workflow can be swapped out in the future than if it can also easily be swapped into another institution's code base?
General tools
what are some general tools that can help facilitate this process of open sourcing workflows?
git-init
I think one of the most promising tools is git, either from the command line or through some graphical user interface. I work with Github a lot, and I see it as a possible centralized tool for hosting workflow documentation from different institutions. Sometimes, I think something like github gets seen by archivists as irrelevant to the work that they do because they don’t write code, but the github website doesn’t only have to be a repository for code. it could also contain documentation of workflows, such as PDFs or readmes in markdown that outline an institution’s workflow and specifications. A benefit of using github for workflow and policy documentation is version control- it allows for the public to see how and why a workflow or policy has changed, which provides greater transparency into archival policy decisions land labor. Some libraries and archives already do this, and it's awesome to see.
Neatrour, Anna and Wolcott, Liz. (2015, November 24). Library Workflow Exchange: Sharing Library Innovation [blog post]. Retrieved from https://www.diglib.org/archives/10844/
Library workflow exchange is not really a tool, but it is a space that I find exciting, with regards to sharing knowledge about library and archival workflows. the two librarians- Liz Woolcott and Anna Neatrour, that started library workflow exchange share the same dream that I do- quote "a magical database that would allow us to find out what other libraries are doing to automate workloads" and so they started the library workflow exchange. there’s options to self-submit your workflow for institutions that don't have the availability to host on institutional websites, and the site quote "pulls workflows from websites, blogs, conference presentations, Github, and a host of other places".
Dinah's projects
Now I'm going to talk about two pieces of software that are examples of what we mean by open-sourcing audio visual workflows.
media microservices
media microservices are a set of open source micro service scripts that I’ve been working on as part of my time as a national digital stewardship resident at CUNY television. media microservices are written in bash, and perform much of the labor of processing digital media, so they essentially are the archiving and preservation workflow, and much of our documentation of how they work or what they do is stored in the comments within the code.
as Ashley noted earlier, what’s useful about implementing a microservices approach is that we can make modifications to individual processes without having to overhaul the entire workflow, and add in new functionality when needed. With microservices, we also aren’t restricted to one workflow imposed by a software system, which makes it easier to adapt as technology changes. This is especially important with AV materials, as digital preservation software doesn't always have the functionality, or isn't optimized to deal with complex and large audio visual files. While media microservices are developed for our institutional needs, but they also work just as individual services - the gif I'm showing here is of the makeyoutube microservice, which transcodes one or multiple inputs according to the specifications for upload to youtube. if an institution wants to use all of the media microservices, they can as they come with a configuration file that can be set up to process and deliver based on local needs, and there’s comprehensive instructions as part of the readme.
vrecord
another example of an open source av workflow is vrecord, which is open source audio visual digitization software. vrecord was initially created at cuny television, following a hardware and software update that made the proprietary software final cut pro unusable for digitization. vrecord is downloadable via git hub and homebrew, and has grown into a project that is worked on by many individuals at various institutions. vrecord uses the open source software ffmpeg and works with black magic design's open source software development kit. While vrecord doesn’t solve the problem of expensive and difficult to obtain hardware needed for digitization, it does make digitization more accessible.
Ashley's projects
I'm going to talk about a few projects that are microservices that are developed in the open, or supplemental non-programming-specific projects that are in the open.
QCTools is a project I work on coming from Bay Area Video Coalition.
It’s software that helps detect problems in digitized analog video. QCTools is for running quality control/analysis on these videos for errors, which is great for doing inspections on video after digitization or after coming back from a vendor. QCTools works for single files right now but has recently received grant funding and support from Indiana University to allow this to work as a microservice, and add a database and web server for batch-level processing and analysis.
MediaConch is another project that I work on. It’s for video file conformance checking. It’s so your video files are what they say they are. This is funded by the European Commission, and the software is required to be open source.
MediaConch is based out of MediaInfo, which is… I’d say the most-used media microservice among information professionals. MediaInfo gives you information about your files! But more importantly, it does it very quickly, and even works on partial files that are in transit. So it’s very easy to integrate MediaInfo into much larger projects (and this is what we intend to do with MediaConch too). So not only do you get information about your files, but you know if those files are happy and healthy and conforming to your institution's policy.
An example of integration of these tools is within Artefactual’s Archivematica, which uses mediainfo and will soon be using mediaconch, as part of the suite of services that get files into digital long-term preservation-level storage.
MediaConch and Artefactual use Matroska, which is an open video format currently going through the standardization process via the Internet Engineering Task Force's CELLAR (Codec Encoding for LossLess Archiving and Realtime transmission) working group. This is to really ensure their longevity as a recommended digital preservation file format.
ffmprovisr is a good example of a small project that helps many people at a lot of different institutions, and has had work contributed by people at many institutions. It’s a platform for sharing very small scripts -- FFmpeg scripts. It's hosted on the AMIA Open Source committee's github page. There’s been a lot of collaboration in the Issues section and through pull requests, even with contributions from FFmpeg contributors. It's great to see so many people come together and share knowledge.
A/VAA is a wiki for sharing video playback problems where people can contribute their video playback errors, explain or describe problems, or figure out why their videos look weird.
A/VAA is a wiki for sharing video playback problems where people can contribute their video playback errors, explain or describe problems, or figure out why their videos look weird.
... and the rest will follow ...
So free your workflows and the rest will follow!
In conclusion
Grow beyond your institution
More people, more perspectives
More contributions
Supportive environments
No reinventing the wheel
Opens up time for other things
Community
Shared goals
In conclusion... your projects can grow beyond your own institution. More people provide more perspectives. More contributions make projects better, and we can provide each other with supportive environments.
In conclusion
Grow beyond your institution
More people, more perspectives
More contributions
Supportive environments
No reinventing the wheel
Opens up time for other things
Community
Shared goals
Additionally, when we don't have to spend time reinventing the wheel, it opens up time for us to do other things that are difficult and time consuming, like advocacy, writing documentation, or any other task that computers can't do. Furthermore, when we open source our workflows, we build a community of practicioners who can collaborate to make shared goals a reality.
Links!
These slides: github.com/ablwr/free_your_workflows
github.com/mediamicroservices/mm
github.com/amiaopensource/vrecord
libraryworkflowexchange.org
github.com/mediaarea/mediaconch
github.com/bavc/qctools
github.com/amiaopensource/ffmprovisr
avaa.bavc.org
Thanks!
@dericed, @ndsr, @nypl, FOSS contributors
... and the rest will follow ...