The "gem" file format is a self-contained standard way to package and distribute Ruby programs and libraries. It is used by RubyGems, the default package manager for Ruby. A Ruby gem is built from a given ".gemspec" file which contains the dependencies and version information of all the libraries used by the application. The source code of the application is also packaged into the gem along with some metadata. In this article, we will see how we can extract the ruby source code files from the packaged ".gem" file.
###Gem File Format
At first, we need to understand the contents of a typical Ruby gem. The ".gem" file is just a standard POSIX tar archive. We can confirm this by running the Unix file
command as follows:
Asankhayas-MacBook-Pro:Downloads asankhaya$ file thor-0.19.1.gem
thor-0.19.1.gem: POSIX tar archive
Given that it is a standard tar archive, we can rename the file from ".gem" to ".tar" and open it by extracting the archive. It will show that each gem contains three gzip files inside - checksums.yaml.gz
, data.tar.gz
and metadata.gz
. The checksums.yaml
file contains the SHA1 and SHA512 hashes of the other two gzip files:
---
SHA1:
metadata.gz: 3a362ea0b9b3cf1f41649c522ddc312925cf1e47
data.tar.gz: 9267cf56eb7c014270c8077b1ed4b2c95fcaa7ea
SHA512:
metadata.gz: c659d6a5020fa953ec51394089c5a49a4fae9afefc2190a088a8e481a6ac586fbe2c268a47204d66b19ca660468f1e3e87f4e5be9b509faf83fc3992b2b7eb42
data.tar.gz: fac520f0a428f1cf3ba627b47285c04534e11633599d82c7b15596eb76c1e1762d66b32195e8519252c58a061ccb54e37e8bfa4ba98684c6b0efc8f208cd66f4
The metadata
file contains details about the gem like its version, author, dependencies and list of files. The data
gzip file is itself an archive which contains among other things, the source code of the gem. Unzipping and extracting the archive will show the following folder structure:
Asankhayas-MacBook-Pro:data asankhaya$ ls
CHANGELOG.md README.md bin spec
LICENSE.md Thorfile lib thor.gemspec
In the given folder structure, the lib
directory contains the Ruby source code files while the bin
directory has the compiled binary. Thus, in order to extract the source files from the packaged gem we need to first extract the gem as a tar archive, then extract the data archive inside it and finally the source files can be found in the lib
folder under data
. Now that we have a basic understanding of the contents in the gem file let us see how we can process the ".gem" file and extract Ruby source code using a Java program.
###Extracting from the Gem File using Java
To process a '.tar' file in Java we can make use of the Apache Commons Compress library. The library provides TarArchiveInputStream
which can be used to extract a tar and loop over the entires in the archive. Similarly, for the '.gz' file we can use GZIPInputStream
from java.util.zip
package to extract the archive. The following code snippet shows how to use them for extracting the Ruby source files.
File gemFile = new File("thor-0.19.1.gem");
InputStream gemStream = new FileInputStream(gemFile);
// Treat the gem file as a tar archive
TarArchiveInputStream tarGemStream = new TarArchiveInputStream(gemStream);
ArchiveEntry gemEntry;
while((gemEntry = tarGemStream.getNextEntry()) != null) {
if(gemEntry.getName().equals("data.tar.gz")) {
// data.tar.gz is a GZIp archive
GZIPInputStream gzStream = new GZIPInputStream(tarGemStream);
TarArchiveInputStream dataTarStream = new TarArchiveInputStream(gzStream);
ArchiveEntry sourceEntry;
while((sourceEntry = dataTarStream.getNextEntry()) != null) {
// Look for .rb files in the lib directory
if(sourceEntry.getName().startsWith("lib/") && sourceEntry.getName().endsWith(".rb")) {
//This is a Ruby source file which can be parsed using the JRuby Parser.
}
}
}
}
The only tricky bit to note here is that while processing the data.tar.gz
file, we need to first use GZIPInputstream
and then pass that to the constructor of TarArchiveInputStream
to open the .gz
and then the subsequent .tar
archive. Once we have access to the correct archive, while looping over the entries we need to check the lib
folder for the Ruby source code files by comparing the extension with the file extension .rb
. This will ensure that we process all the source files inside the gem. The Ruby source code itself may be parsed using the JRuby Parser.