onseok

Contributing to the AWS Amplify Android Open Source Project

AWS Amplify is a product that helps mobile and frontend developers build and deploy secure, scalable full-stack applications on AWS. In a project we were running at my company, I implemented a feature to periodically upload log files to S3 and continuously monitored it. I discovered that java.lang.OutOfMemoryError errors were repeatedly occurring across multiple Android devices.

Analyzing the Root Cause

Here is a summary of the collected crash reports:

Fatal Exception: java.lang.OutOfMemoryError: Failed to allocate a 8388620 byte allocation with 8388608 free bytes and 19MB until OOM
       at java.util.IdentityHashMap.resize(IdentityHashMap.java:476)
       at java.util.IdentityHashMap.put(IdentityHashMap.java:452)
       ...
       at com.amplifyframework.storage.s3.transfer.TransferWorkerObserver$attachObserver$2.invokeSuspend(TransferWorkerObserver.kt:199)

It appeared that in the TransferWorkerObserver class, LiveData.observeForever() was being called repeatedly for the same transfer tag, registering multiple observers. This caused the internal IdentityHashMap in LiveData to grow excessively, resulting in memory allocation failures. The memory usage was further amplified by the LiveData usage coupled with Room’s InvalidationTracker.

Filing the Issue

I first filed an issue following the template to share this problem.

However, for security reasons, I could not attach our entire company codebase in the snippet, so I removed all sensitive code and rewrote only the core logic. Here is the code snippet I included in the issue:

@HiltWorker
class LogUploadWorker @AssistedInject constructor(
    @Assisted private val appContext: Context,
    @Assisted workerParams: WorkerParameters,
    @Dispatcher(IO) private val ioDispatcher: CoroutineDispatcher
) : CoroutineWorker(appContext, workerParams) {

    override suspend fun getForegroundInfo(): ForegroundInfo =
        appContext.logSyncForegroundInfo()

    override suspend fun doWork(): Result = withContext(ioDispatcher) {
        try {
            val externalFilesDirPath = inputData.getString("externalFilesDirPath")
            val externalFilesDir = externalFilesDirPath?.let { File(it) }
            if (externalFilesDir != null && externalFilesDir.exists()) {
                uploadLogFiles(externalFilesDir)
            }
        } catch (e: Exception) {
            Timber.e("Log upload exception: ${e.message}")
            Result.retry()
        }
        Result.success()
    }

    @OptIn(ExperimentalCoroutinesApi::class, FlowPreview::class)
    private suspend fun uploadLogFiles(externalFilesDir: File) {
        val curFiles = externalFilesDir.listFiles()?.filter { file ->
            System.currentTimeMillis() - file.lastModified() < TWO_WEEK_TIME_MILLIS
        }?.sortedByDescending { it.lastModified() }

        if (curFiles.isNullOrEmpty()) {
            Timber.i("No log files or directory found.")
            return
        }

        val environmentPrefix = if (BuildConfig.DEBUG) "debug" else "release"
        val dateFormat = SimpleDateFormat("yyyy-MM-dd", Locale.getDefault())

        curFiles.forEach { file ->
            val date = dateFormat.format(Date(file.lastModified()))
            val fileIndex = file.name.substringBefore("_logs.txt").takeLastWhile { it.isDigit() }
            val key = "$environmentPrefix/sample/$date/log$fileIndex.txt"

            try {
                val result = Amplify.Storage.uploadFile(
                    StoragePath.fromString("public/$key"),
                    file
                ).result()
                Timber.i("Log file upload successful: ${result.path}")
            } catch (error: Exception) {
                Timber.e("Log file upload failed: ${error.message} - ${error.cause}")
            }
        }
    }

    companion object {
        private const val TWO_WEEK_TIME_MILLIS = 14 * 24 * 60 * 60 * 1000L

        fun startUpUploadWork(
            externalFilesDirPath: File?
        ) = PeriodicWorkRequestBuilder<LogUploadWorker>(
            1, TimeUnit.HOURS,
            5, TimeUnit.MINUTES
        )
            .setConstraints(LogSyncConstraints)
            .setBackoffCriteria(
                BackoffPolicy.LINEAR,
                MIN_BACKOFF_MILLIS,
                TimeUnit.MILLISECONDS
            )
            .setInputData(
                Data.Builder()
                    .putString("externalFilesDirPath", externalFilesDirPath?.absolutePath)
                    .build()
            )
            .build()
    }
}

To briefly explain the logic in the snippet above: it creates a file object based on the log file directory path, filters files modified within the last 2 weeks, sorts them by most recently modified, and applies a linear retry policy when exceptions occur. For each file, it constructs an AWS S3 storage key by combining the build type, the file’s modification date, and the index from the file name, then uploads the file and logs the result with Timber. Since this is a periodically executed task, I used WorkManager’s PeriodicWorkRequest.Builder.

Through the comments on that issue, the maintainers asked about the frequency of occurrence and the device specifications where the problem was happening. At the time, I was thinking that rather than the aws-amplify/amplify-android library itself being the problem, perhaps it was my own code that had the issue, or whether I could optimize further to prevent OOM errors. To address this, I made the following improvements:

However, despite these efforts, OOM errors continued to be collected periodically.

Fixing It Myself

When engaging in open source activities, it is important to comply with licensing regulations. Also, before contributing to an open source project, you should carefully read the CONTRIBUTING.md file to ensure you follow the guidelines.

I decided to fix the issue myself. I re-analyzed the collected crash reports, modified the code, and created a PR. Here is the modified code:

import java.util.concurrent.ConcurrentHashMap

private val observedTags = ConcurrentHashMap.newKeySet<String>()

private suspend fun attachObserver(tag: String) {
    withContext(Dispatchers.Main) {
        if (!observedTags.add(tag)) return@withContext
        val liveData = workManager.getWorkInfosByTagLiveData(tag)
        liveData.observeForever(this@TransferWorkerObserver)
    }
}

private suspend fun removeObserver(tag: String) {
    withContext(Dispatchers.Main) {
        if (!observedTags.remove(tag)) return@withContext
        workManager.getWorkInfosByTagLiveData(tag)
            .removeObserver(this@TransferWorkerObserver)
    }
}

I added logic to the TransferWorkerObserver class in the AWS Amplify Storage module to prevent duplicate observer registration. ConcurrentHashMap is thread-safe, and ConcurrentHashMap.newKeySet() provides a Set interface. If the tag already exists, add() returns false, and the function exits via an early return. Similarly, remove() returns false if the tag is not in observedTags, and the function exits. This effectively prevents unnecessary duplicate tag registrations.

A Rewarding Result

After receiving positive feedback from a maintainer calling it a good idea, I addressed a few suggestions, and the PR was ultimately approved. The fix was included in the 2.27.1 release.

Through this experience, I learned the importance of actively contributing PRs to fix errors discovered in open source libraries used in company projects. It also gave me the opportunity to explore the internal workings of a library I had been using frequently but had never examined in depth.

Regrets and Reflections

If I had more persistently analyzed the heap dump, could I have identified the root cause faster and had more evidence-based issue tracking?

In retrospect, I should have captured heap dumps early on using Android Studio’s Memory Profiler and performed a before-and-after comparison to quantify the leak. A side-by-side analysis of retained object counts and shallow/retained heap sizes for IdentityHashMap and LiveData observer entries would have provided concrete evidence of the unbounded growth, making the issue report far more compelling. Additionally, running the LeakCanary library in a debug build configured with the periodic upload schedule could have surfaced the leak automatically, saving significant manual investigation time.

#opensource #aws-amplify